FP-growth

Find Frequent Item Sets with the FP-growth Algorithm

Note: This documentation refers to FP-growth version 6.17 and may not be compatible with other versions.
Call fpgrowth without any options or arguments to check the actually supported options.

Contents

Introduction

FP-growth is a program for frequent item set mining, a data mining method that was originally developed for market basket analysis. Frequent item set mining aims at finding regularities in the shopping behavior of the customers of supermarkets, mail-order companies and online shops. In particular, it tries to identify sets of products that are frequently bought together. Once identified, such sets of associated products may be used to optimize the organization of the offered products on the shelves of a supermarket or the pages of a mail-order catalog or web shop, may give hints which products may conveniently be bundled, or may allow to suggest other products to customers. However, frequent item set mining may be used for a much wider variety of tasks, which share that one is interested in finding regularities between (nominal) variables in a given data set. For an overview of frequent item set mining in general and several specific algorithms (including FP-growth), see the survey [Borgelt 2012].

This page describes the FP-growth implementation that I have been developing and improving since 2004. This implementation covers the basic scheme as developed in [Han et al. 2000], which introduced the FP-tree as the core data structure. It also supports filtering for closed and maximal item sets with conditional item set repositories as suggested in [Grahne and Zhu 2003], although the approach used in the program differs in as far as it used top-down prefix trees rather than FP-trees. It does not cover the clever implementation of FP-trees with two integer arrays as suggested in [Rasz 2004].

Note that the current version of this program can only find frequent item sets, not association rules.

Enjoy,
Christian Borgelt

back to the top top

Basic Notions

This section briefly introduces some basic notions needed to talk about frequent item sets. These notions are item, transaction, support and confidence as well as frequent, closed and maximal item set. For a more extensive coverage, consult the survey [Borgelt 2012].


Items and Transactions

On an abstract level, the input to frequent item set mining consists of a bag or multiset of transactions that are defined over a set of items, sometimes called the item base. These items may be products, special equipment items, service options etc. For an abstract treatment it suffices that items have an identity, that is, it is possible to distinguish one item from another. The item base is the set of all considered items, for example, the set of all products that are sold by a given supermarket, mail-order company or online shop. Any subset of the item base is called an item set.

A transaction is simply an item set and it represents, for example, the set of products bought by a customer. Since two or more customers may, in principle, buy the exact same set of products, we cannot model the whole of all "shopping baskets" or "shopping carts" (bought, say, in a given week) as a set of transactions, since in a set each element is unique. There are several solutions to this problem: one may, for example, model the whole of all transactions as a bag or multiset (a generalization of a set, which allows for multiple occurrences of the same element) or as a vector (where elements at different positions may be the same, but are still distinguished by their position), or by extending each transaction with a unique transaction identifier (or tid for short; note that the position of a transaction in a vector representation is an implicit transaction identifier). Still another possibility consists in using a standard set of (unique) transactions and assigning to each of them an occurrence counter. Here I use the bag/multiset terminology, even though an occasional "set of transactions" may have slipped through. (This should always be read as "bag/multiset of transactions".)

Note that the item base (the set of all considered items) is often not given explicitly, but only implicitly as the union of all given transactions. This is also the case for my FP-growth program, which by default only takes a transaction file as input. However, it is also possible to specify the item base explicitly with an optional item appearances file (discussed in this section). This can be useful, for example, if one wants to restrict the analysis to a subset of all items. It can also be used to specify that certain items should only appear in the antecedents or only in the consequents of reported association rules.

back to the top top

Support of an Item Set

Let S be an item set and T the bag/multiset of all transactions under consideration. Then the absolute support (or simply the support) of the item set S is the number of transactions in T that contain S. Likewise, the relative support of S is the fraction (or percentage) of the transactions in T which contain S.

More formally, let S be an item set and U = { X∈T | S⊆t } the bag/multiset of all transactions in T that have S as a subset (i.e. contain all of the items in S and possibly some others). Then

suppabs(S) = |U|  = |{ X∈T | S⊆t }|

is the absolute support of S and

supprel(S) = (|U| / |T|) *100%

is the relative support of S. Here |U| and |T| are the number of elements in U and T, respectively.

An important insight is that support (regardless of whether it is stated in an absolute or relative manner) is anti-monotone, that is

∀ I, J: (J ⊆ I) ⇒ (supp(J) ≥ supp(I)).

That is, if an item set is enlarged by adding one or more items, its support cannot increase.

In a supermarket setting, the item set S may be a set like S = { bread, wine, cheese } and T may be the bag/multiset of all "baskets" or "carts" of products bought by the customers of a supermarket – in a given week if you like. U is the bag/multiset of all transactions in T that contain all items in S (and maybe also some other items). For example, if a customer buys the set X = { milk, bread, apples, wine, sausages, cheese, onions, potatoes }, then S is obviously a subset of X, hence X is in U. If there are 318 customers, each giving rise to one transaction, and 242 of these customers bought such a set X or a similar one that contains S, while the other customers bought sets of products that lacked at least one of the items in S, then suppabs(S) = 242 and supprel(S) = 242/318 = 76.1%.

The goal of frequent item set mining is to find all item sets (that is, all subsets of the item base) that occur in the given bag/multiset of transactions with at least a user-specified minimum support suppmin. Such item sets are called frequent item sets.

Since support is anti-monotone, it follows immediately that

∀ I, J: (supp(I) < suppmin ∧ J ⊇ I) ⇒ (supp(J) < suppmin).

This relationship is called the apriori property, which allows to prune the search for frequent item sets effectively: no supersets of an infrequent item set can be frequent and thus need not be visited in the search.

The default value for the minimum support in my FP-growth program is 10% (the percentage indicates implicitly that it refers to relative support). This value can be changed with the option -s. Note that the argument to this option is interpreted as a percentage if it is positive, but if it is negative, it is interpreted as an absolute number (number of transactions) rather than a percentage. That is, -s20 means a minimum relative support of 20%, while -s-20 means a minimum absolute support of 20 transactions.

back to the top top

Confidence of an Association Rule

If we search for association rules, we do not want just any association rules, but "good" association rules. To measure the quality of association rules, [Agrawal and Srikant 1994], the inventors of the Apriori algorithm, introduced the confidence of a rule. The confidence of an association rule R = "X → Y" (with item sets X and Y) is the support of the set of all items that appear in the rule (here: the support of S = X ∪ Y) divided by the support of the antecedent (also called "if-part" or "body") of the rule (here X). That is,

conf(R) =  supp(X ∪ Y)

supp(X)

(Note that it does not matter whether the confidence is computed from the absolute or the relative support of an item set, as long as the same support type is used in both the numerator and the denominator of the fraction.)

More intuitively, the confidence of a rule is the number of cases in which the rule is correct relative to the number of cases in which it is applicable. For example, let R = "wine and bread → cheese". If a customer buys wine and bread (and maybe some other items), then the rule is applicable and it says that he/she can be expected to buy cheese. If he/she does not buy wine or does not buy bread or buys neither, then the rule is not applicable and thus (obviously) does not say anything about this customer.

If the rule is applicable, it says that the customer can be expected to buy cheese. But he/she may or may not buy cheese, that is, the rule may or may not be correct (for this customer). Naturally, we are interested in how good the prediction of the rule is, that is, how often its prediction that the customer buys cheese is correct. The rule confidence measures this: it states the percentage of cases in which the rule is correct. It states this percentage relative to the number of cases in which the antecedent holds, since these are the cases in which the rule makes a prediction that can be true or false. If the antecedent does not hold, then the rule does not make any prediction, so these cases are excluded.

Rules are reported as association rules if their confidence reaches or exceeds a given lower limit (minimum confidence; to be specified by a user). That is, we look for rules that have a high probability of being true: we look for "good" rules, which make correct (or very often correct) predictions. My FP-growth program always uses a minimum confidence to select association rules. The default value for the minimum confidence is 80%. This value can be changed with the option -c. (Note that for the minimum confidence, the argument is always interpreted as a percentage. Negative values cause an error message, because there is no "absolute confidence".)

In addition to the rule confidence, my FP-growth program lets you select from several other (additional) rule evaluation measures, which are explained below, but it will also use rule confidence. If you want to rely entirely on some other measure, you can do so by setting the minimal rule confidence to zero. (Attention: If you have a large number of items, setting the minimal rule confidence to zero can result in very high memory consumption. Therefore: use this possibility with a lot of care, if at all.)

back to the top top

Support of an Association Rule

The support of association rules may cause some confusion, because I use this term in a different way than [Agrawal and Srikant 1994] do. For them, the support of an association rule "A and B → C" is the support of the set S = { A, B, C }. This may be fine if rule confidence is the only evaluation measure, but it causes problems if some other measure is used. For these other measures it is often much more appropriate to call the support of the antecedent of the association rule, that is, the support of X = { A, B } in the example above, the support of the association rule.

The difference can also be stated in the following way: for [Agrawal and Srikant 1994], the support of the rule is the (absolute or relative) number of cases in which the rule is correct (that is, in which the presence of the item C follows from the presence of the items A and B), whereas for me (and thus my FP-growth program) the support of a rule is the (absolute or relative) number of cases in which it is applicable (that is, in which the antecedent of the rule holds), although in some of these cases it may be false (because only the items A and B are present, but the item C is missing).

One reason for this choice, as already mentioned, is that the definition of [Agrawal and Srikant 1994] does not work well for evaluation measures other than rule confidence. This is explained in more detail below. Another reason is that I prefer the support of a rule to say something about the "statistical" support of a rule and its confidence, that is, from how many cases the confidence is computed in order to express how well founded the statement about the confidence is.

Maybe an example will make this clearer. Suppose you have a die which you suspect to be biased. To test this hypothesis, you throw the die, say, a thousand times. 307 times the 6 turns up. Hence you assume that the die is actually biased, since the relative frequency is about 30% although for an unbiased die it should be around 17%. Now, what is the "statistical" support of this statement, that is, on how many experiments does it rest? Obviously it rests on all 1000 experiments and not only on the 307 experiments in which the 6 turned up. This is so, simply because you had to do 1000 experiments to find out that the relative frequency is around 30%, and not only the 307 in which a 6 turned up (doing only these experiments is obviously impossible).

Or suppose you are doing an opinion poll to find out about the acceptance of a certain political party, maybe with the usual question "If an election were held next Sunday ...?" You ask 2000 persons, of which 857 say that they would vote for the party you are interested in. What is the support of the assertion that this party would get around 43% of all votes? It is the size of your sample, that is, all 2000 persons, and not only the 857 that answered in the positive. Again you had to ask all 2000 people to find out about the percentage of 43%. Of course, you could have asked fewer people, say, 100, of which, say, 43 said that they would vote for the party, but then your statement would be less reliable, because it is less "supported". The number of votes for the party could also be 40% or 50%, because of some random influences. Such deviations are much less likely, if you asked 2000 persons, since then the random influences can be expected to cancel out.

The rule support can be used to filter association rules by stating a lower bound for the support of a rule (minimum support). This is equivalent to saying that you are interested only in such rules that have a large enough statistical basis (since my FP-growth program uses the term "support" in my interpretation and not in the one used by [Agrawal and Srikant 1994]. The default value for this support limit is 10%. It can be changed with the option -s. Note that the argument, if positive, is interpreted as a percentage. If, however, the given argument is negative, it is interpreted as an absolute number (number of transactions) rather than a percentage.

The minimum support is combined with the minimum confidence to filter association rules. That is, my FP-growth program generates only association rules, the confidence of which is greater than or equal to the minimum confidence and the support of which is greater than or equal to the minimum support.

Despite the above arguments in favor of my definition of the support of an association rule, a rule support compatibility mode is available (due to the overwhelming pervasiveness of the original definition). With the option -o the original rule support definition can be selected. In this case the support of an association rule is the support of the set with all items in the antecedent (also called "if-part" or "body") and the consequent (also called "then-part" or "head") of the association rule, that is, the support of an association rule as defined in [Agrawal and Srikant 1994].

back to the top top

Consequent of an Association Rule

Although association rules are generally defined as rules X → Y with two disjoint item sets X and Y (of which at most X may be empty, while |Y| ≥ 1), my FP-growth program only generates association rules with a single item in the consequent (that is, with |Y| = 1). I have been asked many times about this restriction (that is, why no rules with |Y| > 1 are generated), so I finally decided to write down the explanation as part of this documentation.

Let me first emphasize that allowing multiple items in the consequents of association rules generally leads to a (much) larger number of rules. There are usually already (many) more association rules than item sets if only a single item is allowed in the consequents (since each item set can give rise to multiple rules, although the filtering by confidence mitigates the situation and -- in extreme cases, for example, for a very high minimum confidence -- can reverse the relationship). This number becomes even larger if one allows for multiple items in the consequent. Multiple items in the consequents of association rules therefore come at a considerable cost, which must be counterbalanced by some added benefit in order to justify them.

Here the following considerations come into play: in the first place, if

x → y,z

is an association rule, then the two simpler rules

x → y    and    x → z

must also be association rules, because generally (regardless of how the support of an association rule is defined -- see here for the two possibilities)

supp(x → y) ≥ supp(x → y,z),     conf(x → y) ≥ conf(x → y,z),
supp(x → y) ≥ supp(x → y,z),     conf(x → y) ≥ conf(x → y,z).

That is, even if we restrict the output to rules with only a single item in the consequent, we necessarily obtain rules with the item y and z in the consequent, which provide (almost) the same information. If your goal is, say, to suggest other products to a customer that already placed the item x into its shopping cart, these two simpler rules may be perfectly sufficient to suggest both the items y and z. You do not need a rule x → y,z for that.

Of course, the rule x → y,z provides additional information that goes beyond the information of the two rules x → y and x → z: its confidence describes the conditional (condition x) joint occurrence of y and z, while the two rules x → y and x → z only describe the individual conditional occurrence of y and z. However, in what applications is this additional information actually an added benefit (if it is a benefit at all) that justifies to generate considerably more rules? My personal opinion is that the two simpler rules are sufficient for all practical purposes.

Secondly, if we define the support of an association rule supp(X → Y) = supp(X ∪ Y) (see here for the two possibilities), we can even go one step further: if

x → y,z

is an association rule, then the two rules

x,y → z    and    x,z → y

must also be association rules, because they have the same support as the rule above and generally

conf(X → Y) ≥ conf(X \ {i} → Y ∪ {i}).

(That is, if in an association rule an item is moved from the antecedent to the consequent, the support of the rule cannot increase. This follows directly from the fact that support is anti-monotone and specifically that supp(X \ {i}) ≥ supp(X) for any X ≠∅.)

These association rules have only one item in the consequent and yield, though in a different form, information about the joint occurrence of y and z given the condition x. In a setting in which other products are to be suggested to a customer, they can be used in chaining fashion. Given that a customer placed the item x into its shopping cart, we first use the rule x &rarr y to infer that the customer is likely interested in y, then hypothesize that the customer will place item y into the shopping, and apply the rule x,y → z to the resulting situation to suggest z as well. In this way we achieve the same result, without any loss of information or approximation, that would result from x → y,z.

Because of these arguments I think that there is no good reason to generate association rules with more than one item in the consequent: there is no true benefit, but a considerable cost from having to deal with a (often considerably) larger number of rules.

back to the top top

Target Types

An annoying problem in frequent item set mining is that the number of frequent item sets is often huge and thus the output can easily exceed the size of the transaction database to mine. In order to mitigate this problem, several restrictions of the set of frequent item sets have been suggested. These restrictions are covered by the target type.

The target type, which can be selected via the option -t, is either frequent item sets (default, option -ts), closed item sets (option -tc), maximal item sets (option -tm), generators (also called free item sets, option -tg) or association rules (option -tr).

More detailed information about the different target types of frequent item set mining can be found in the survey [Borgelt 2012].


Frequent Item Sets (default, option -ts)

Often one only wants to find frequent item sets. That is, one wants to find all item sets with a support exceeding a certain threshold, the so-called minimum support. For my FP-growth program this is the default operation mode. However, this mode can also be selected explicitly with the option -ts.


Closed Item Sets (option -tc)

A frequent item set is called closed if no superset has the same support (or, in other words, if all supersets have a lower support). Formally, an item set I is called closed iff

∀ J ⊃ I: supp(J) < supp(I).

If the option -tc is given, the found frequent item sets are subsequently filtered and only the closed item sets are reported.


Maximal Item Sets (option -tm)

A frequent item set is called maximal if no superset is frequent, that is, has a support reaching or exceeding the minimum support. Formally, an item set I is called maximal iff

∀ J ⊃ I: supp(J) < suppmin.

If the option -tm is given, the found frequent item sets are subsequently filtered and only the maximal item sets are reported.


Generators/Free Item Sets (option -tg)

A frequent item set is called a generator or free if no subset has the same support (or, in other words, if all subsets have a larger support). Formally, an item set I is called a generator iff

∀ J ⊂ I: supp(J) > supp(I)

If the option -tg is given, the found frequent item sets are subsequently filtered and only the generators / free item sets are reported.

back to the top top

Association Rules (option -tr)

My FP-growth program generates association rules if the the option -tr is given. Note, however, that it produces only association rules with a single item in the consequent (also called "then-part" or "head" of the rule). This restriction is due to the following considerations:

In the first place, association rule mining usually produces too many rules even if one confines oneself to rules with only one item in the consequent. So why should one make the situation worse by allowing more than one item in the consequent? (It merely blows up the size of the output.)

Secondly, I do not know any application in which rules with more than one item in the consequent are of any real use. The reason, in my opinion, is that such more complex rules add almost nothing to the insights about the data set. To understand this, consider the simpler rules that correspond to a rule with multiple items in the consequent, that is, rules having the same antecedent, but consequents with only single items from the consequent of the complex rule. All of these rules must necessarily be in the output, because neither their support (in my interpretation, see this section) nor their confidence can be less than that of the more complex rule. That is, if you have a rule A B → C D, you will necessarily also have the rules A B → C and A B → D in the output. Of course, these latter two rules together do not say the same as the more complex rule; they do contain additional information. However, what do you gain from the additional information the more complex rule gives you? How can you use it? And is this little extra information worth having to cope with a much bigger rule set? In my opinion, the answer is a clear no.

back to the top top

Program Invocation

My FP-growth implementation is a command line program that has to be called in a terminal or command window or from a shell script. If double-clicked in a file manager (e.g. in Microsoft Windows), it merely prints a usage message to a terminal/command window that is briefly opened and immediately closed again. This does not mean that the program does not work. In order to run the program, open a terminal/command window, change directory to where you stored the program, and then type an invocation command.

The general syntax of the program invocation is

fpgrowth [options] infile [outfile]

The first argument infile, which is mandatory, is the name of a file that contains the transactions to analyze. The format of this file is described in this section. If instead of a file name a single minus sign "-" or an empty string "" is given, the input is read from standard input rather than from a file.

The second argument outfile, which is optional (as indicated by the brackets), is the name of a file to which the found frequent item sets are to be written. That it is optional is useful for benchmark tests, where the time it takes to write the output to a file can conceal the actual search time, or if only a pattern spectrum (number of found frequent item sets collected by size and (absolute) support; option -P) is to be found. The format in which frequent item sets are written to the output file is described in this section. If instead of a file name a single minus sign "-" or an empty string "" is given, the output is written to standard output rather than to a file.

In addition to the input and output file several options can be given, all of which consist of a minus sign and a single letter. The full list of options can be found in the next section.

Some options take a parameter. For example, the option -s, with which the minimum support is specified, takes a number as a parameter, which must follow the letter s without any separating space. A space between the option character and its parameter is only allowed if the parameter is a string, for example, a file name. (However, even in this case the parameter may follow the option letter directly.) If the parameter is a number or a single letter, it must follow the option letter directly.

Options may be combined. For example,

fpgrowth -s10m2n5 input output

means that the minimum support is 10% (option -s), the minimum number of items in an item set is 2 (option -m) and the maximum number of items in an item set is 5 (option -n).

Options may be specified anywhere on the command line, that is, before the input file name, in between the input and output file names, or after the output file name.

If an option is given more than once, the last statement counts. That is, with

fpgrowth -s10 -s20 input output

the minimum support is 20%, as the -s10 is overwritten by the following -s20.

back to the top top

Program Options

My FP-growth implementation supports the following options
(a # after the option letter means that the option takes a parameter):

optionmeaningdefault
-t# target type
s: frequent item sets
c: closed (frequent) item sets
m: maximal (frequent) item sets
g: (frequent) generators
r: association rules
s
-m# minimum number of items per item set 1
-n# maximum number of items per item set no limit
-s# minimum support of an item set
positive: percentage of transactions
negative: absolute number of transactions
10
-S# maximum support of an item set
positive: percentage of transactions
negative: absolute number of transactions
100
-o use original definition of the support of a rule (body & head)
-c# minimum confidence of a rule as a percentage 80
-e# additional evaluation measure
frequent item sets:
x: no measure
b: binary logarithm of support quotient (+)
association rules:
x: no measure
o: rule support (original def.: body & head) (+)
c: rule confidence (+)
d: absolute confidence difference to prior (+)
l: lift value (confidence divided by prior) (+)
a: absolute difference of lift value to 1 (+)
q: difference of lift quotient to 1 (+)
v: conviction (inverse lift for negated head) (+)
e: absolute difference of conviction to 1 (+)
r: difference of conviction quotient to 1 (+)
z: certainty factor (relative confidence change) (+)
n: normalized χ2 measure (+)
p: p-value from (unnormalized) χ2 measure (-)
y: normalized χ2 measure with Yates' correction (+)
t: p-value from Yates-corrected χ2 measure (-)
i: information difference to prior (+)
g: p-value from G statistic/information difference (-)
f: Fisher's exact test (table probability) (-)
h: Fisher's exact test (χ2 measure) (-)
m: Fisher's exact test (information gain) (-)
s: Fisher's exact test (support) (-)
All measures for association rules are also applicable
to item sets and are then aggregated over all possible
association rules with a single item in the consequent.
The aggregation mode can be set with the option -a#.
Measures marked with (+) must meet or exceed the threshold,
measures marked with (-) must not exceed the threshold
in order for the rule or item set to be reported.
x
-a# aggregation mode for evaluation measure
x: no aggregation (use first value)
m: minimum of individual measure values
n: maximum of individual measure values
a: average of individual measure values
s: split item set into equal size subsets
x
-d# threshold for additional evaluation measure
(as a percentage)
10
-i invalidate evaluation below expected support evaluate all
-p# (minimum size for) pruning with evaluation
< 0: weak forward
> 0: strong forward
= 0: backward pruning
no pruning
-q# sort items w.r.t. their frequency
0: do not sort
1: ascending, -1: descending w.r.t. item frequency
2: ascending, -2: descending w.r.t. transaction size sum
2
-A# variant of the FP-growth algorithm to use
s: simple tree nodes with only successor and parent
c: complex tree nodes with children and siblings
d: top-down processing on a single prefix tree
t: top-down processing of the prefix trees
Variant d does not support mining closed/maximal item sets,
variant t does not support the use of a k-items machine, and
only variant c supports item reordering w.r.t. conditional support,
but closed/maximal item sets can only be mined without reordering.
c
-x do not prune with perfect extensions prune
-l# number of items for k-items machine
(only for algorithm variants i,r,o)
16
-j do not sort items w.r.t. cond. support
(only for algorithm variant c, option -Ac)
-u do not use head union tail (hut) pruning
(only for maximal item sets, not with algorithm variant b)
use hut
-F#:#.. support border for filtering item sets
(list of minimum support values, one per item set size,
starting at the minimum size, as given with option -m#)
none
-R# read an item selection from a file
parameter: file name
-P# write a pattern spectrum to a file
parameter: file name
-Z print item set statistics
(number of item sets per size)
-N do not pre-format some integer numbers
-g write output in scanable form
(quote certain characters)
-h# record header for output ""
-k# item separator for output " "
-I# implication sign for association rules " "
-v# output format for item set information
(changed to " (%a)" if parameter of -s is negative)
" (%S)"
" (%a)"
-w transaction weight in last field only items
-r# record/transaction separators "\n"
-f# field/item separators " \t,"
-b# blank characters " \t\r"
-C# comment characters "#"
-! print additional option information
back to the top top

Input Format

Format of the Transactions File

The transactions file has to be a (text) file structured by field and record separators and blanks. Record separators, not surprisingly, separate records, usually lines (since the default record separator is the newline character), field separators separate fields, usually words (since among the default field separators are the space and the tabulator, but also the comma). Blanks are used to fill fields, for example, to align them. In the transactions file each record must contain one transaction, i.e. a list of item identifiers, which are separated by field separators. An empty record is interpreted as an empty transaction. That is, it is counted for the total number of transactions, but does not count for the support of any item set other than the empty set. A transaction may contain the same item multiple times (that is, duplicate items do not raise an error), but this multiplicity is disregarded by the FP-Growth program. It only considers whether an item is contained in a transaction or not, not how many times an item is contained in a transaction.

Example input files can be found in the directory fpgrowth/ex in the source package. These files are used in the following to demonstrate how to use the command line options -r, -f, -b and -w. In addition, there are several conversion scripts (for Linux/Unix), with which different common input formats can be converted into the format required by the FP-growth program.

In the file test1.tab transactions are separated by newline characters (that is, each line of the file contains one transaction) and the items of a transaction are separated by spaces. That is, the file test1.tab looks like this:

a b c
a d e
b c d
a b c d
b c
a b d
d e
a b c d
c d e
a b c

As can be seen, there are ten transactions over the item base {a, b, c, d, e}, which contain between two and four items each.

The file test1.tab is in the standard input format and hence it can be processed directly:

fpgrowth test1.tab test1.out

Instead of spaces, tabulators may be used, and it is possible to mix spaces and tabulators. Note, however, that multiple consecutive white space characters (like multiple spaces or a space and a tabulator etc.) are interpreted as a single field separator. The reason is that by default spaces, tabulators and the carriage return character are interpreted as blank characters, which are removed when reading items. Hence only the first white space character after an item is interpreted as a field separator, all following white space characters are interpreted as blanks.

Note also that commas are among the default field separators as well. That is, if the file test1.tab looked like this:

a,b,c
a,d,e
b,c,d
a,b,c,d
b,c
a,b,d
d,e
a,b,c,d
c,d,e
a,b,c

it could still be processed directly with the command stated above. You may also mix spaces, tabulators and commas.

Unfortunately, though, the fact that commas are interpreted as field separators does not necessarily mean that CSV-files (where CSV stands for "comma separated values"), as they can be written by programs like Microsoft Excel, can be processed directly. The reason is that in a CSV-file all lines contain the same number of fields. That is, in CSV-format, the above input file would look like this (file test1.csv in directory fpgrowth/ex):

a,b,c,
a,d,e,
b,c,d,
a,b,c,d
b,c,,
a,b,d,
d,e,,
a,b,c,d
c,d,e,
a,b,c,

Note the single and double commas at the end of most lines, which separate empty fields (as these transactions have fewer items than the longest transaction). While a single comma at the end of a line does not cause any problems and is simply ignored, two or more commas lead to an error message "item expected", because an item may not be an empty string. This can be fixed by declaring the comma a blank character (option -b). That is, the CSV-file can be processed with:

fpgrowth -b, test1.csv test1.out

Note, however, that the characters given with the option -b replace the default blank characters. So if you still need spaces to be interpreted as blanks, they have to be specified as well:

fpgrowth -b" ," test1.csv test1.out

In the file test2.tab the same transactions can be found, but several different field separators are used:

a,b,c
a,d,e
b.c.d
a,b,c,d
b:c
a,b,d
d,e
a,b,c,d
c;d;e
a,b,c

The file test2.tab can be processed by declaring different field separators with the option -f:

fpgrowth -f",.;:" -l test2.tab test2.out

The file test3.tab has basically the same format as the file test1.tab, with the only difference that the last fields of each record states an (integer) transaction weight/multiplicity.

a b c 2
a d e 1
b c d 1
a b c d 2
b c 1
a b d 1
d e 1
c d e 1

This allows us to combine transactions, so that test2.tab has only 8 lines, while test1.tab has 10 lines, because the transactions a b c and a b c d occur twice. In order to instruct the FP-growth program to interpret the last field of each record as such a weight/multiplicity, is has to be invoked with the option -w:

fpgrowth -w test3.tab test3.out

The files test4.tab to test6.tab are in formats that may be common, but which cannot be processed directly with the FP-growth program.

In the file test4.tab each line contains a transaction identifier and an item, separated by a space (not shown because of the large number of lines). This file can be converted (on Linux/Unix systems) into the standard input format with the script tid2set, i.e., with

tid2set test4.tab x.tab

Note that this script sorts the input file (here: test4.tab) w.r.t. the transaction identifier, so that items belonging to the same transaction occupy consecutive lines/records. That is, the input need not already be sorted by transaction identifier, rather the script does this to make sure that the conversion works.

In the file test5.tab the first line states the item names and the following lines contain flags T (true) and F (false) depending on whether the item is contained in the transaction represented by the line or not:

a b c d e
T T T F F
T F F T T
F T T T F
T T T T F
F T T F F
T T F T F
F F F T T
T T T T F
F F T T T
T T T F F

This format can be converted (on Linux/Unix systems) into the standard input format with the script flg2set, i.e., with

flg2set test5.tab x.tab

This script interprets T, t, and 1 as "true", that is, as indicators that the item corresponding to the column is contained in the transaction corresponding to the row, and any other entry as "false" (the item is not contained in the transaction).

In the file test6.tab there is one item per line and transactions are separated by blank lines (not shown here because of the large number of lines). This format can be converted (on Linux/Unix systems) into the standard input format with the script row2set, i.e., with:

row2set test6.tab x.tab

Note that the file test6.tab could be processed directly if the transactions were not separated by a mere empty line (that is, two newline characters following each other), but if the empty line contained a special character, for example %. In this case the file can be processed directly with

fpgrowth -r"%" -f"\n" -b"\n" test6.tab x.tab

The additional scripts tab2set and hdr2set convert tables with column numbers or column names into a format appropriate for the FP-growth program. They are invoked in the same way as all other scripts discussed above, i.e., with

tab2set a.tab b.tab

or

hdr2set a.tab b.tab

where a.tab is the name of the input file and b.tab the name of the output file. The script tab2set replaces each table entry "x" of the input file by "Xi=x", where i is the column number (starting with 1).

The script hdr2set reads the variable names from the first line of the input file and then replaces each table entry "x" by "X=x", where "X" is the variable name that was found in the corresponding column of the first line. These scripts are handy if you want to process tabular data by treating each table row as a transaction.

Note that any input may also be read from standard input and any output may be sent to standard output, simply by specifying a '-' or an empty string "" instead of a filename. For example (on a Linux/Unix system)

cat test1.tab | fpgrowth - -

reads the transactions from standard input (where they are fed by the cat command) and writes the found frequent item sets directly to the terminal. They may be piped to any other program or script, since all other messages of the FP-growth program are written to standard error.

back to the top top

Format of the Item Appearances File

In addition to an input file with the transactions and an output file for the found item sets, an item selection/appearances file may be passed to the FP-growth program with the option -R.

If the target is some kind of frequent item set, this file lists a selection of items that are to be included in the search, while any items not listed in this file are to be ignored. It can be seen as a specification of the item base for the search.

The item selection file has to be a (text) file structured by the same field and record separators and blanks as the transactions file. Items may be listed in a single record or in multiple records; it does not matter in which record an item occurs. All that matters is whether an item occurs in the item selection file (then it is included in the search) or not (then it is ignored).

If the target is association rules, the file specified with the option -R is used not only to specify which items are to be included in the search, but also in which part of reported association rules (antecedent and/or consequent) an item may appear.

The first record, which must have one field, contains the default appearance to be used with all items not mentioned in the appearances file. The following records records state the appearance of specific items, one per record. The first field of these records states the item, the second the appearance indicator. If no appearance indicator is given, the item will be ignored (i.e. may appear neither in the antecedent (body) nor in the consequent (head) of a rule). Empty records are ignored.

The following appearance indicators are recognized:

Example 1: Generate only rules with item "x" in the consequent.

in
x out

Example 2: Item "x" may appear only in rule consequents (head), item "y" only in rule antecedents (body); appearance of all other items is not restricted.

both
x head
y body

Providing no item appearances file is equivalent to an item appearances file containing only an indicator like "both", which does not restrict the appearance of any items.

back to the top top

Output Format

The output format for item sets and association rules is fairly flexible. Especially the output of the additional information about item sets and association rules (support, confidence etc.) can be formatted freely.


Format of Frequent Item Sets

Each line of the output file contains one item set in the format

A B C [...] <add. info>

where A, B and C are item identifiers, and <add. info> is determined by the additional information output format (option -v). The item separator may be changed with the option -k. For example,

fpgrowth -k, test1.tab test.out

produces output in the format

A,B,C,[...] <add. info>

Each output line may be preceded by a string that is specified with the option -h (record header for output).

The output format for the additional rule information can be any string with the following special format symbols (similar to the style of the special format symbols of the printf function in the programming language C):

%% a percent sign
%i number of items (item set size)
%a absolute item set support
%s relative item set support as a fraction
%S relative item set support as a percentage
%e additional evaluation measure
%E additional evaluation measure as a percentage

All format specifiers can be extended by an integer number between the percent sign and the defining character, which specifies the number of decimal digits to be printed (expect those that refer to integer numbers, like the number of items and the absolute support, for which such a number of digits is ignored). If no such number is given, (up to) six significant digits are reported.

The default additional information output format is for item sets " (%S)". That is, the relative support of the item set is printed with (up to) six significant digits, enclosed in parentheses. Two spaces separate the additional information from the last item in the set. The default format is automatically changed to " (%a)" if the minimum support (option -s) is given as a negative number (absolute support), so that the output is consistent with the support threshold.

back to the top top

Format of Association Rules

Each line of the output file describes one association rule in the format

C <- A B [...] <add. info>

where A, B and C are item identifiers, and <add. info> is determined by the additional information output format (option -v). The implication sign can be changed with the option -I. The item separator can be changed with the option -k. For example,

fpgrowth -tr -k, test1.tab test.out

produces output in the format

C <- A,B, [...] <add. info>

Each output line may be preceded by a string that is specified with the option -h (record header for output).

The output format for the additional rule information can be any string with the following special format symbols (similar to the style of the special format symbols of the printf function in the programming language C):

%% a percent sign
%a absolute item set support
%s relative item set support as a fraction
%S relative item set support as a percentage
%b absolute body set support
%x relative body set support as a fraction
%X relative body set support as a percentage
%h absolute head item support
%y relative head item support as a fraction
%Y relative head item support as a percentage
%c rule confidence as a fraction
%C rule confidence as a percentage
%l lift value of a rule (confidence/prior)
%L lift value of a rule as a percentage
%e additional evaluation measure
%E additional evaluation measure as a percentage

All format specifiers can be extended by an integer number between the percent sign and the defining character, which specifies the number of decimal digits to be printed (expect those that refer to integer numbers, like the number of items and the absolute support, for which such a number of digits is ignored). If no such number is given, (up to) six significant digits are reported.

The default additional information output format for rules is " (%X, %C)". That is, the relative support of the rule antecedent and the confidence of the rule are printed with (up to) six significant digits. These two values are separated by a comma and enclosed in parentheses. Two spaces separate the additional information from the last item in the rule.

back to the top top

Format of a Pattern Spectrum

A pattern spectrum collects the number of found frequent item sets sub-divided by the size and the support of the item sets. Pattern spectra (together with surrogate data/randomization methods) can be useful to determine the statistical significance of found frequent item sets. Details can be found in the papers [Picado-Muino et al. 2013] and [Torre et al. 2013].

My FP-growth implementation can be instructed to collect and write a pattern spectrum with the option -P, which takes the name of the file to which the patterns spectrum is to be written as a parameter.

A pattern spectrum is written as a list of (size, support, count) triplets. The output file is a simple text file that contains one triplet per line, with the three numbers separated by spaces. For example, for the input file test1.tab (see this section) the pattern spectrum is (with default option settings):

1 3 1
1 6 1
1 7 3
2 1 2
2 3 1
2 4 4
2 5 1
2 6 1
3 1 2
3 2 1
3 3 2
3 4 1
4 2 1

The first column contains the different sizes of item sets, the second column the difference support values, and the third column the number of item sets found for the corresponding (size, support) combination. For example, the sixth row indicates that there are four frequent item sets with two items and support 4. Note that in a pattern spectrum the support is always given as absolute support, that is, as a number of transactions.

back to the top top

Extended Rule Selection

If association rules are selected using a minimum confidence, the following problem arises: "Good" rules (rules that are often true) are not always "interesting" rules (rules that reveal something about the interdependence of the items). You certainly know the examples that are usually given to illustrate this fact. For instance, it is easy to discover in a medical database that the rule "pregnant → female" is true with a confidence of 100%. Hence it is a perfect rule, it never fails, but, of course, this is not very surprising. Although the measures explained below cannot deal with this problem (which is semantical), they may be able to improve the results in a related case.

Let us look at the supermarket example again and let us assume that 60% of all customers buy some kind of bread. Consider the rule "cheese → bread", which holds with a confidence of, say, 62%. Is this an important rule? Obviously not, since the fact that the customer buys cheese does not have a significant influence on him/her buying bread: The percentages are almost the same. But if you had chosen a confidence limit of 60%, you would get both rules "∅→ bread" (confidence 60%) and "cheese → bread" (confidence 62%), although the first would suffice (the first, because it is the simpler of the two). The idea of all measures, which can be used in addition or instead of rule confidence, is to handle such situations and to suppress the second rule.

In addition, consider the following case: Assume that the confidence of the rule "cheese → bread" is not 62% but 35%. With a confidence limit of 60% it would not be selected, but it may be very important to know about this rule! Together with cheese, bread is bought much less frequently than it is bought at all. Is cheese some kind of substitute for bread, so that one does not need any bread if one has cheese? Ok, maybe this is not a very good example. However, what can be seen is that a rule with low confidence can be very interesting, since it may capture an important influence. Furthermore, this is a way to express negation (although only in the consequent of a rule), since "cheese → bread" with confidence 35% is obviously equivalent to "cheese → no bread" with confidence 65%. This also makes clear why the support of the item set that contains all items in the antecedent ("if-part", "head") and the consequent ("then-part", "body") of the rule is not appropriate for this measure. An important rule may have confidence 0 and thus a support (in the interpretation of [Agrawal and Srikant 1994]) of 0. Hence it is not reasonable to set a lower bound for this kind of support.

I hope that the intention underlying all of these explanations is clear by now: Potentially interesting rules differ significantly in their confidence from the confidence of rules with the same consequent, but a simpler antecedent. Adding an item to the antecedent is informative only if it significantly changes the confidence of the rule. Otherwise the simpler rule suffices.

Unfortunately the measures other than rule confidence do not solve the rule selection problem in the very general form in which it was stated above. It is not that easy to deal with all rules that have a simpler antecedent, to keep track of which of these rules were selected (this obviously influences the selection of more complicated rules), to deal with the special type of Poincare paradox that can occur etc. Hence the measures always compare the confidence of a rule with the confidence of the rule with an empty antecedent, that is, with the relative frequency of the consequent.

I call the confidence of an association rule with an empty antecedent the prior confidence (of any rule with the same consequent), since it is the confidence that the item in the consequent of the rule will be present in a transaction prior to any information about other items that may be present. (Note that the prior confidence of a rule is obviously equal to the (relative) support of its consequent item.) The confidence of an association rule with non-empty antecedent (and the same consequent) I call the posterior confidence (or simple the confidence) of the rule, since it is the confidence that the item in the consequent of the rule will be present after it becomes known that the items in the antecedent of the rule are present.

Most measures, which can be computed with my FP-growth program and can be used for filtering in addition to the rule confidence, are basically computed from these two values: the prior confidence and the posterior confidence. Only those association rules are reported for which the value of the chosen additional evaluation measure is in a certain relation to certain limit (that is, either (1) meets or exceeds the limit or (2) does not exceed the limit – which relation applies depends on the chosen measure). The measures are chosen with the option -e, the limit is passed to the program via the option -d. The default value for the limit is 10%. Note that the option -d always interprets its argument as a percentage. As a consequence, the desired limit value may sometimes have to be multiplied by 100 in order to obtained the needed argument value.

Note that all additional rule evaluation measures are combined with the limits for rule confidence and rule support. That is, my FP-growth program reports only those rules, the confidence of which is greater than or equal to the minimum confidence, the support of which is greater than or equal to the minimum support, and for which the additional evaluation value (if selected) is in the measure-specific relation to the limit for this measure. The default is to use no additional evaluation measure (but this may also be specified explicitly with the option -ex), that is, to rely only on rule confidence and rule support. Of course you can remove the restriction that the rule support and the rule confidence must meet or exceed certain limits by simply setting either (or both) of these limits to zero. In this case rules are selected using only the limit for the additional evaluation measure. (Attention: If you have a large number of items, setting the minimum rule support or the minimum rule confidence to zero can result in very high memory consumption. Therefore: use this possibility with a lot of care, if at all.)

back to the top top

Absolute Confidence Difference to Prior (option -ed)

The simplest way to compare the posterior and the prior confidence of an association rule (since the former should differ considerably from the latter to make the rule interesting) is to compute the absolute value of their difference. That is, if "∅→ bread" has a confidence of 60% and "cheese → bread" has a confidence of 62%, then the value of this measure is 2% (for the second rule). The parameter given with the option -d to the program states a lower bound for this difference (in percentage points). It follows that this measure selects rules, the (posterior) confidence of which differs more than a given threshold from the corresponding prior confidence.

For example, with the option -d20 (and, of course, the option -ed to select this measure) only rules with a confidence less than 40% or greater than 80% would be selected for the item "bread" in the consequent. As a consequence, the selected rules are those, for which the antecedent considerably changes the confidence. (Note that, of course, for other items, with a different prior confidence, the upper and lower bounds are different. For example, they are 10% and 50% for a rule with a prior confidence of 30%.)

back to the top top

Lift Value (Confidence Quotient, option -el)

The so-called lift value is simply the quotient of the posterior and the prior confidence of an association rule. That is, if "∅→ bread" has a confidence of 60% and "cheese → bread" has a confidence of 72%, then the lift value (of the second rule) is 72/60 = 1.2. Obviously, if the posterior confidence equals the prior confidence, the value of this measure is 1. If the posterior confidence is greater than the prior confidence, the lift value exceeds 1 (the presence of the antecedent items raises the confidence), and if the posterior confidence is less than the prior confidence, the lift value is less than 1 (the presence of the antecedent items lowers the confidence).

More formally, the lift of the rule R = X → Y is

lift(R) =  conf(X → Y)

conf(∅ → Y)
 =  supp(X ∪ Y)/supp(X)

supp(Y)/supp(∅)

where supp(∅) = |T|, the size of the transaction database (number of transactions).

The value that can be passed to the program with the option -d is a lower limit for this measure: only rules for which this measure meets or exceeds the given value are reported. As a consequence, the selected rules are those that raise the confidence by at least a given minimum factor. Note, however, that the option -d always interprets its arguments as a percentage. Therefore, in order to filter rules with a lift value of at least 1.2, the option -d120 must be provided (that is, the argument must be the limit for the lift value times 100%).

back to the top top

Absolute Difference of Lift Value to 1 (option -ea)

With the lift value (see preceding section) it is not possible to filter association rules for which the lift value differs more than a given amount from 1, because the limit passed to the program with the option -d is a lower limit. Hence only rules with a lift value that exceeds 1 by a given minimum amount (and thus, for which the presence of the antecedent items raises the confidence by at least a given factor) can be properly filtered.

In order to be able to filter rules for which the lift value differs by more than a given amount from 1, the absolute difference of the lift value to 1 can be used (option -ea). For example, if the prior confidence of an association rule (the support of its consequent) is 60% and the option -d20 is given, rules with a confidence less than or equal to (1 –20%) *60% = 0.8 *60% = 48% or a confidence greater than or equal to (1 +20%) *60% = 1.2 *60% = 72% are selected. Similarly, if the prior confidence is 30%, the numbers are 0.8 *30% = 24% and 1.2 *30% = 36%.

As these examples show, the main difference between this measure and the absolute difference of the posterior and prior confidences is that the deviation that is considered to be significant depends on the prior confidence (12% deviation for 60% prior confidence, but only 6% deviation for 30% prior confidence). The idea is that for a high prior confidence the deviation of the posterior confidence must also be high, and if it is low, the deviation only needs to be low.

(I am grateful to Roland Jonscher, S-Rating GmbH, Berlin, Germany, who pointed out this measure to me.)

back to the top top

Difference of Lift Quotient to 1 (option -eq)

Instead of simply forming the absolute difference of the lift value of an association rule, one may also compute the difference of either the lift value or its reciprocal, whichever is smaller, to one. Since either the lift value or its reciprocal must be less than or at most equal to one (and necessarily non-negative), this yields a value between 0 and 1. This measure can be selected with the option -eq, a lower limit can, as usual, be specified with the option -d.

For example, if the prior confidence of an association rule (the support of its consequent) is 60% and the option -d20 is given, association rules with a confidence less than or equal to (1 –20%) *60% = 0.8 *60% = 48% or a confidence greater than or equal to 60% /(1 –20%) = 60% /0.8 = 75% are selected. On the other hand, if the prior confidence is only 30%, rules with a (posterior) confidence of no more than 0.8 *30% = 24% or at least 30% /0.8 = 37.5% are selected (with -d20).

Note that the main difference to the absolute difference of the lift value to one (see the preceding section) is that positive and negative deviations are treated differently. With the absolute difference of the lift value to one, positive and negative deviations (posterior confidence exceeding and falling short of the prior confidence, respectively) are treated the same. The deviation must be by the same amount (in percentage points) in order to select the rule. With this measure (difference of lift quotient to 1), however, a negative deviation by a certain number of percentage points can lead to the selection of the rule, while the same positive deviation (in percentage points) need not lead to a selection of the rule. For the example above, with a prior confidence of 60%, the positive deviation limit is 15%, the negative deviation limit only 12%. Likewise, for a prior confidence of 30%, the positive deviation limit is 7.5%, while the negative deviation limit is only 6%.

back to the top top

Conviction (Inverse Lift for Negated Head, option -ev)

The conviction of an association rule R = X → Y is the inverse lift of the rule R' = X → not Y. That is, while the lift of the rule R is

lift(R) =  conf(X → Y)

conf(∅ → Y)

the conviction of the rule R is

cvct(R) =  1 – conf(∅ → Y)

1 – conf(X → Y)
 =  conf(∅ → not Y)

conf(X → not Y)

where supp(∅) = |T|, the size of the transaction database.

Intuitively, the conviction states by what factor the correctness of the rule (as expressed by its confidence) would reduce if the antecedent and the consequent of the rule were independent. A high value therefore means that the consequent depends strongly on the antecedent. Consequently, the value passed with the option -d is interpreted as a lower limit. Note, however, that the option -d always interprets its arguments as a percentage (cf. the discussion of the lift value). Therefore, in order to filter rules with a conviction value of at least 1.2, the option -d120 must be provided (that is, the argument must be the limit for the conviction value times 100%).

back to the top top

Absolute Difference of Conviction to 1 (option -ee)

Analogous to Absolute Difference of Lift Value to 1, only with conviction instead of lift.

back to the top top

Difference of Conviction Quotient to 1 (option -er)

Analogous to Difference of Lift Quotient to 1, only with conviction instead of lift.

back to the top top

Certainty Factor (option -ez)

The certainty factor of a rule states by how much the prior confidence changes to the posterior confidence relative to the maximumally possible change in the same direction. That is, the certainty factor distinguishes whether the posterior confidence is larger or smaller than the prior confidence. If it is larger, it relates the change to a hypothetical change to the maximally possible posterior confidence 100%. That is, in this case the certainty factor is

cf(R) =  conf(X → Y) – conf(∅ → Y)

100% – conf(∅ → Y)

On the other hand, that is, if the posterior confidence is smaller than the prior confidence, it relates the change to a hypothetical change to the minimally possible posterior confidence 0%. That is, in this case the certainty factor is

cf(R) =  conf(∅ → Y) – conf(X → Y)

conf(∅ → Y) – 0%

In this way the certainty factor assesses a small increase of an already high prior confidence as more significant than the same increase of a smaller prior confidence. For example, an increase from 90% to 92% gives rise to a certainty factor of 0.2 = 20% (2% increase divided by 10% maximally possible change from 90% to 100%). The same increase from 60% to 62%, however, only gives rise to certainty factor of 0.05 = 5% (2% increase divided by 40% maximally possible change from 60% to 100%).

back to the top top

Normalized χ2-Measure (option -en)

The χ2-measure is well known from statistics. It is often used to measure the difference between a conjectured independent distribution of two discrete variables and the actual joint distribution in order to determine how strongly two variables depend on each other (or whether they are independent). The two (binary) variables considered here are the consequent and the antecedent of the rule, namely whether (all) the items contained in them are present or not (variable X: all antecedent items are present (X=1) versus at least one absent (X=0), and variable Y: consequent item present (Y=1) versus consequent item absent (Y=0)).

Formally, the χ2-measure can be defined as

χ2(R) = n (n01n10 n11n)2

n01 (n – n01) n10 (n – n10)

where the first index of the n's refers to X and the second to Y and n = |T| is the total number of transactions in the database.

Clearly, the χ2-measure contains the number n of cases it is computed from as a factor. Even though this is, of course, highly relevant in statistics, this is not very convenient if one wants to filter association rules that can have different support. Hence in my FP-growth program this factor is removed by simply dividing the measure by the total number n of transactions. With this normalization, the χ2-measure can have values between 0 (no dependence) and 1 (very strong – or actually perfect – dependence). The value that can be passed with the -d option is a lower bound for the strength of the dependence of the consequent on the antecedent in percent (0 – no dependence, 100 – perfect dependence). Only those association rules are selected, in which the consequent depends on the antecedent with this or a higher degree of dependence.

back to the top top

p-Value Computed from χ2 Measure (option -ep)

Provided that the two considered variables are independent, the χ2-measure has a χ2-distribution, which, for two binary variables, has one degree of freedom. Hence it is possible to compute the probability that the χ2-value of an association rule under consideration can be observed by chance under the assumption that the antecedent and the consequent of the rule are independent.

Since with this measure small p-values indicate interesting rules, the value given as a parameter to the option -d is an upper limit. Note also that the option -d interprets its parameter as a percentage, so that -d1 means that only association rules with a p-value smaller than 1% = 0.01 are reported.

back to the top top

Normalized χ2-Measure with Yates' Correction (option -ey)

The χ2-distribution, that is used to compute a p-value from the χ2-measure is a continuous distribution, but the χ2-measure is computed from the discrete entries of a 2x2 contingency table. As a consequence, the assumption that the distribution of the (discrete) χ2-measure can be approximated by the (continuous) χ2-distribution is not quite correct and can lead to some approximation error. To reduce this error, Yates suggested a correction of the χ2-measure, which is also known as Yates' correction for continuity. Its intention is to prevent an overestimation of significance (that is, p-values smaller than justified) if the counters in the contingency table have small values (small underlying data set).

Formally, the Yates-corrected χ2-measure can be defined as

χ2(R) = n (n01n10 n11n -0.5n)2

n01 (n – n01) n10 (n – n10)

where the first index of the n's refers to X and the second to Y and n = |T| is the total number of transactions in the database.

Like the standard χ2-measure, The Yates-corrected χ2-measure contains the number n of cases it is computed from as a factor. Even though this is, of course, highly relevant in statistics, this is not very convenient if one wants to filter association rules that can have different support. Hence in my FP-growth program this factor is removed by simply dividing the measure by the total number n of transactions. With this normalization, the Yates-corrected χ2-measure can have values between 0 (no dependence) and 1 (very strong – or actually perfect – dependence). The value that can be passed with the -d option is a lower bound for the strength of the dependence of the consequent on the antecedent in percent (0 – no dependence, 100 – perfect dependence). Only those association rules are selected, in which the consequent depends on the antecedent with this or a higher degree of dependence.

back to the top top

p-Value Computed from Yates-corrected χ2-Measure (option -et)

Analogous to the p-value computed from the standard χ2-measure, but with the Yates-corrected χ2-measure.

back to the top top

Information Difference to Prior (option -ei)

The information difference to the prior is simply the information gain criterion (also called mutual information that is also used, for example, in decision tree learners like C4.5 to select the split attributes. Its basic idea is as follows: Without any further information about other items in the set, we have a certain probability (or, to be exact, a relative frequency) distribution for, say "bread" and "no bread". Let us assume it is 60% : 40% (prior confidence of the item "bread", just as above). This distribution has a certain entropy

H = - P(bread) log2 P(bread) - P(no bread) log2 P(no bread),

where P(bread) is equivalent to the (relative) support of "bread", which in turn is equivalent to the prior confidence of "bread". The entropy of a probability distribution is, intuitively, a lower bound on the number of yes-no-questions you have to ask in order to determine the actual value. This cannot be understood very well with only two possible values, but it can be made to work for this case too. I skip the details here, they are not that important for this application of information gain.

After we get the information that the items in the antecedent of the association rule are present (say, cheese), we have a different probability distribution, say 35% : 65%. I.e., P(bread|cheese) = 0.35 and P(no bread|cheese) = 0.65. If we also know the support of the item "cheese" (let it be P(cheese) = 0.4 and P(no cheese) = 0.6), then we can also compute the probabilities P(bread|no cheese) = 0.77 and P(no bread|no cheese) = 0.23. Hence we have two posterior probability distributions: one in case cheese is also present, and one in case cheese is not present. The question now is: How much information do we receive by observing whether the antecedent of the rule holds or not? Information is measured as a reduction of entropy. Hence the entropies of the two conditional probability distributions (one for "cheese" and one for "no cheese") are computed and summed weighted with the probability of their occurrence (that is, the relative frequency of "cheese" and "no cheese", respectively). This gives the (expected value of the) posterior or conditional entropy. The difference of this value to the prior entropy (see above) is the gain in information from the antecedent of the rule or, as I called it here, the information difference to the prior.

The value that can be given via the -d option is a lower bound for the information gain, measured in hundreds of a bit (percent of one bit). (Note that, since all items can only be present or absent, the information gain can be at most one bit.)

back to the top top

p-Value Computed from G-Statistic (option -eg)

The so-called G-statistic is a less well-known alternative to the χ2-measure that is based on information gain (or mutual information). Formally, it is proportional to the product of the information gain and the number of cases the information gain is computed from (here: the (absolute) support of the antecedent of the association rule). Under independence, the G-statistic also has a χ2-distribution and thus it can be used in a similar fashion as the χ2-measure to compute a p-value. This measure (p-value computed from G-statistic) can be selected with the option -eg.

Since with this measure small p-values indicate interesting rules, the value given as a parameter to the option -d is an upper limit. Note also that the option -d interprets its parameter as a percentage, so that -d1 means that only association rules with a p-value smaller than 1% = 0.01 are reported.

back to the top top

Fisher's Exact Test; Table Probability (option -ef)

Like the p-value computed from the χ2-measure, Fisher's exact test computes a p-value from a contingency table. It is often used to measure the difference between a conjectured independent distribution of two discrete variables and the actual joint distribution in order to determine how strongly two variables depend on each other (or whether they are independent). The two (binary) variables considered here are the consequent and the antecedent of the rule, namely whether (all) the items contained in them are present or not (variable X: all antecedent items are present (X=1) versus at least one absent (X=0), and variable Y: consequent item present (Y=1) versus consequent item absent (Y=0)).

In Fisher's exact test, the marginals of the contingency table are fixed, that is, the support of the antecedent of the rule (that is, supp(X)) and the support of the consequent of the rule (that is, supp(Y)) as well as the total number of transactions, are seen as fixed. Under this assumption, the distribution of the numbers in the cells of the contingency table that respect the marginals follows a hypergeometric distribution.

Fisher's exact test is based on the probability of finding, under the assumption of independence, a contingency table that is at least as extreme as the one that was actually observed. Of course, the notion "at least as extreme" needs clarification. Unfortunately, there are several ways of making this notion mathematically precise, which give rise to different forms of Fisher's exact test. The most common is to order the possible contingency tables based on the probability that is assigned to them by the hypergeometric distribution. In this case the p-value is the sum of the probabilities (as computed from the hypergeometric distribution) of all possible contingency tables that are no more probable than the one actually observed.

Since with this measure small p-values indicate interesting rules, the value given as a parameter to the option -d is an upper limit. Note also that the option -d interprets its parameter as a percentage, so that -d1 means that only association rules with a p-value smaller than 1% = 0.01 are reported.

back to the top top

Fisher's Exact Test; χ2-Measure (option -eh)

Analogous to Fisher's exact text (table probability), but contingency tables ordered by the χ2-measure. In this case the p-value is the sum of the probabilities of all possible contingency tables that have a value of the χ2-measure that is no smaller than the value of the χ2-measure of the table that was actually observed.

back to the top top

Fisher's Exact Test; Information Gain (option -em)

Analogous to Fisher's exact text (table probability), but contingency tables ordered by the Information Gain. In this case the p-value is the sum of the probabilities of all possible contingency tables that have an information gain that is no larger than the information gain of the table that was actually observed.

back to the top top

Fisher's Exact Test; Support (option -es)

Analogous to Fisher's exact text (table probability), but contingency tables ordered by the rule support in its original definition, that is, by the support of all items in the antecedent (body) and the consequent (head) of the rule. In this case the p-value is the sum of the probabilities of all possible contingency tables that have a support for all the items in the antecedent and consequent that is no larger than the support in the table that was actually observed.

back to the top top

Original Rule Support (body & head; option -eo)

The original rule support can be chosen as an additional evaluation measure to make it possible to use both the support of the antecedent (my interpretation of rule support) and of all items occurring in the rule (original interpretation of rule support) to select rules. Note that this measure deviates from the general scheme of comparing the prior and the posterior confidence and thus serves different purposes.

back to the top top

Rule Confidence (option -ec)

The standard rule confidence can also be chosen as an additional rule evaluation measure. Obviously, this is not really useful for filtering association rules (because the rule confidence is always used by default already). Indeed, this option is not relevant for association rule selection, but can be useful for evaluation item sets, see here.

back to the top top

Selection Behavior of Some Measures

In the directory fpgrowth/doc you can find a GnuPlot script named arem.gp (arem stands for "additional rule evaluation measures") which visualizes the behavior of some of some additional rule evaluation measures. This script draws eight 3d graphs, one for the absolute confidence difference, one for the difference of the lift quotient to one, three for the information difference to the prior confidence and three for the normalized χ2-measure. All graphs show the value of the additional rule evaluation measure over a plane defined by the prior and the posterior confidence of a rule. The latter two measures need three graphs, since they depend also on the antecedent support of a rule as a third parameter. Setting a minimal value for an additional rule evaluation measure is like flooding the corresponding graph landscape up to a certain level (given as a percentage, since all considered measures assume values between 0 and 1). Only those association rules are reported that "sit on dry land".

The first graph shows the behavior of the absolute confidence difference. For the diagonal, that is, the line where the prior and the posterior confidence are identical, its value is zero (as expected). The more the two confidences differ, the higher the value of this measure gets, but in a linear way.

The second graph shows the behavior of the lift quotient to one. Again its value is zero for the diagonal (as the value of all measures is) and becomes greater the more the prior and the posterior confidence differ. But it is much steeper for a small prior confidence than for a large one and it is non-linear.

The third to fifth graph show the information difference to the prior confidence for an antecedent support (which is identical to the rule support in my interpretation, see above) of 0.2 (20%), 0.3 (30%) and 0.4 (40%). The regions at the margins, where the measure is zero, correspond to impossible combinations of prior and posterior confidence and antecedent support. As you can see, the valley gets narrower with increasing antecedent support. That is, with the same minimal value for this measure, rules with low antecedent support need a higher confidence difference to be selected than rules with a high antecedent support. This nicely models the statistical significance of confidence changes: if you only have a few cases to support your rule, even a large deviation from the prior confidence can be explained by random fluctuations, since only a few transactions need to be different to produce a considerable change. However, if the antecedent support is large, even a small deviation (in percent) has to be considered as significant (non-random), since it takes a lot of changes to transactions to produce even a small change in the percentage. This dependence on the antecedent support of the rule and that the valley is not pointed at the diagonal (which means that even a low minimal value can exclude a lot of rules) is the main difference between the information gain and the normalized χ2-measure on the one hand and the absolute confidence difference and difference of the lift quotient to one on the other.

The sixth to eighth graph show the normalized χ2-measure for an antecedent support of 0.2 (20%), 0.3 (30%), and 0.4 (40%). The valleys are very similar to those for the information difference to the prior confidence, they only have slightly steeper flanks, especially for low antecedent support. So in practice there is no big difference between the information difference and the normalized χ2-measure.

back to the top top

Extended Item Set Selection

Since versions 4.20 (binary logarithm of support quotient) and 4.36 (other measures) there are also extended selection possibilities for frequent item sets. (These were added due to a cooperation with Sonja Gruen, RIKEN Brain Science Institute, Tokyo, Japan.)


Binary Logarithm of Support Quotient

A simple evaluation measure for item sets is the quotient of the actual support (as computed from the given transactions) and the expected support of an item set. The expected support of an item set can be computed from the support values of the individual items by assuming that the occurrences of all items are independent. Under independence we expect an item set to occur with a relative frequency that is the product of the relative occurrence frequencies of the individual items contained in it. Since this product quickly becomes very small (it decreases basically exponentially with the size of the item set), and thus the quotient of the actual and the expected frequency can become very large, it is advisable not to use the described support quotient directly, but its binary logarithm, so that its values stay in a manageable range. Computing the logarithm also has the advantage that the measure has the value zero for an item set that occurs exactly as often as expected under an assumption of item independence. The binary logarithm of the quotient of the actual and the expected support can be selected with the option -eb.

As for the additional rule evaluation measures, a minimum value for this measure can be set with the option -d. In this case only frequent item sets for which this measure exceeds the given threshold are reported. Note, however, that the option -d generally interprets its argument as a percentage. That is, if you only want item sets reported that occur at least twice as often as expected under independence (and thus desire the binary logarithm of the quotient of the actual and the expected support to be at least 1), you have to specify -d100.

back to the top top

Additional Rule Evaluation Measures

Apart from the measure described in the preceding section (which is specific to item sets), all evaluation measures for association rules, as they were described in this section (including rule confidence, option -ec), can be used to filter item sets. The idea is to form all possible rules with one item in the consequent and all other items of the given item set in the antecedent. Each of these rules is then evaluated with the chosen measure and the results are aggregated. The aggregated value is the evaluation of the item set, and item sets can now be filtered by requiring a minimum value for this aggregation with the option -d.

There are four different aggregation modes for the evaluations of the rules that can be formed from an item set, which can be selected via the option -a:

-ax no aggregation (use first value)
-am minimum of individual measure values
-an maximum of individual measure values
-aa average of individual measure values
-as split into equal size subsets

Here "no aggregation (use first value)" means that only one association rule is formed. This rule has that item in the consequent that comes last in the order of the items. The evaluation of this rule is the evaluation of the item set. For the next three other options all possible rules are formed. For the last option a single rule with half of the items in the antecendent and the other half in the consequent is formed. If the total number of items is not even, the antecedent contains one item more.

The order of the items is determined with the option -q. By default the items are sorted ascendingly w.r.t. the sum of the sizes of the transactions they are contained in. (Note that this choice is based on efficiency issues: it usually leads to the fastest search. However, for use with this aggregation mode, you may want to choose a different order.) Other sorting options are ascendingly w.r.t. the item frequency or descendingly w.r.t. item frequency or the sum of the sizes of the transactions the items are contained in. With the option -q0 the order in which the items are first encountered when the transactions are read is used. A specific, user-defined order may also be chosen, namely by using the option -q0 and using the optional item selection/appearances file (see this section), since the item appearances file is read before the transactions file. Therefore, in this case, the order of items in the selection/appearances file determines the order of the items.

back to the top top

Pruning with Additional Measures

By default only the minimum support is used to prune the search for frequent item sets. That is, if it is discovered that an item set does not meet the user-specified minimum support, no supersets of this item set are considered. (Note that this is a safe pruning rule, because no superset of an infrequent item set can be frequent.) With the option -p the additional item set evaluation measure can also be used for pruning the search additionally.

Pruning with an additional item set evaluation measure comes in two flavors: forward and backward. Backward pruning (which is switched on with the option -p-1), means that all frequent item sets (item sets that reach or exceed the user-specified minimum support) are evaluated with the additional measure, but those item sets, the evaluation of which is less than the user-specified limit (option -d), are discarded.

Note that providing or not providing the option -p-1 makes no difference for finding frequent item sets, since for this target type simply selecting an additional evaluation measure and a limit for its value already have the effect of discarding item sets for which the additional evaluation is too low. However, backward pruning can make a huge difference for finding closed or maximal item sets. The reason is that the option -p changes the order in which item sets are filtered by the additional evaluation measure and are filtered for closed (or maximal) item sets: by default (or if explicitly requested with the option -p0), the item sets are first filtered for closed (or maximal) item sets. Then only the closed (or maximal) item sets are evaluated with the additional measure and thus further reduced to those closed (or maximal) item sets that reach or exceed the user-specified evaluation limit. With the option -p-1 (or any other negative number as the argument - all negative numbers have the same effect) all frequent item sets are first evaluated with the additional measure and those that do not reach or exceed the user-specified limit are discarded. Only afterwards the closed (or maximal) item sets of this reduced set of item sets are determined.

Forward pruning, on the other hand, means that in addition to a check whether the additional evaluation of an item set reaches or exceeds the user-specified limit, it is tested whether all of its subsets (with at least a certain size) reach or exceed the limit. Only if this is the case, the item set is reported. (Technically, this is achieved by not considering any superset of item sets that failed to reach or exceed the user-specified limit for the additional measure - hence the name "forward pruning".)

Since some additional measures (like, for example, the χ2-measure) cannot reasonably be used to evaluate the empty set or single element sets, but also because this restriction can be useful for certain applications, the option -p allows for a parameter that states the minimum size of the subsets that are to be considered. For example, -p2 means that only subsets with at least two elements are required to reach or exceed the user-specified limit; single element sets and the empty set need not do so. Similarly, -p4 means that all subsets up to a size of 3 are ignored in the check; only subsets with at least 4 elements must reach or exceed the limit. Note that the option -p1 usually produces no output, because all single element sets and the empty set are formally assigned an additional evaluation of 0 for most measures. Hence 2 is usually the smallest useful parameter for this option.

Note also that, for example, -p4 does not mean that all reported item sets must have at least four elements. It rather means that there is no additional requirement for item sets up to four elements. If you want to require a minimum size for reported item sets, use the option -m.

back to the top top

Difference of Support Quotient to 1

As with the preceding measure the quotient of actual and expected support of an item set is computed and compared to 1 (a value of 1 signifies independence of the items). A minimum value for this measure can be set with the option -d. In this case only frequent item sets for which this measure exceeds the given threshold are kept.

back to the top top

License

(MIT license, or more precisely Expat License; to be found in the file mit-license.txt in the directory fpgrowth/doc in the source package of the program, see also opensource.org and wikipedia.org)

© 2004-2017 Christian Borgelt

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

back to the top top

Download

Download page with most recent version.

back to the top top

Contact

E-mail: christian@borgelt.net
Website: www.borgelt.net
back to the top top

References

An overview of frequent item set mining covering FP-growth and many other algorithms can be found in this survey paper:

Some implementation aspects are covered in this paper:

The use of pattern spectra to evaluate the statistical significance of found frequent item sets is explained in these papers:

Other references for the FP-growth algorithm include:

back to the top top

© 2004-2017 Christian Borgelt