Carpenter

Find Frequent Item Sets with the Carpenter Algorithm

Note: This documentation refers to Carpenter version 3.8 and may not be compatible with other versions.
Call carpenter without any options or arguments to check the actually supported options.

Contents

Introduction

Carpenter is a program for frequent item set mining, a data mining method that was originally developed for market basket analysis. Frequent item set mining aims at finding regularities in the shopping behavior of the customers of supermarkets, mail-order companies and online shops. In particular, it tries to identify sets of products that are frequently bought together. Once identified, such sets of associated products may be used to optimize the organization of the offered products on the shelves of a supermarket or the pages of a mail-order catalog or web shop, may give hints which products may conveniently be bundled, or may allow to suggest other products to customers. However, frequent item set mining may be used for a much wider variety of tasks, which share that one is interested in finding regularities between (nominal) variables in a given data set. For an overview of frequent item set mining in general and several specific algorithms (including Carpenter), see the survey [Borgelt 2012].

This page describes the Carpenter implementation that I have been developing and improving since 2011. This implementation covers the basic scheme as developed in [Pan et al. 2003], which enumerates transaction sets, in contrast to many other frequent item set mining algorithms, which enumerate item sets. Such an approach can be highly competitive in special cases, namely if there are few transactions and (very) many items, which is a common situation in biological data sets like gene expression data. For other data sets (fewer items, many transactions), however, it is not a recommendable approach.

By default the program finds closed item sets. It can also find maximal item sets, but the filtering of the closed item sets may not be very efficient.

This implementation offers a variant based on transaction identifier lists according to the description in [Pan et al. 2003], although with several optimizations due to which it significantly outperforms the implementation of the Gemini package, which is provided by the authors of [Pan et al. 2003].

The alternative is a variant based on an item occurrence counter table, which bears some vague resemblance to the horizontal approach in the RERII algorithm [Cong et al. 2004]. Which of the two variants is faster depends on the data set. By default, the variant to be used is chosen automatically based on the table size (essentially: if the data table fits into the processor cache, use the table, otherwise use the transaction identifier list version).

Note that the current version of this program can only find frequent item sets, not association rules.

Enjoy,
Christian Borgelt

back to the top top

Basic Notions

This section briefly introduces some basic notions needed to talk about frequent item sets. These notions are item, transaction, support and frequent, closed and maximal item set. For a more extensive coverage, consult the survey [Borgelt 2012].


Items and Transactions

On an abstract level, the input to frequent item set mining consists of a bag or multiset of transactions that are defined over a set of items, sometimes called the item base. These items may be products, special equipment items, service options etc. For an abstract treatment it suffices that items have an identity, that is, it is possible to distinguish one item from another. The item base is the set of all considered items, for example, the set of all products that are sold by a given supermarket, mail-order company or online shop. Any subset of the item base is called an item set.

A transaction is simply an item set and it represents, for example, the set of products bought by a customer. Since two or more customers may, in principle, buy the exact same set of products, we cannot model the whole of all "shopping baskets" or "shopping carts" (bought, say, in a given week) as a set of transactions, since in a set each element is unique. There are several solutions to this problem: one may, for example, model the whole of all transactions as a bag or multiset (a generalization of a set, which allows for multiple occurrences of the same element) or as a vector (where elements at different positions may be the same, but are still distinguished by their position), or by extending each transaction with a unique transaction identifier (or tid for short; note that the position of a transaction in a vector representation is an implicit transaction identifier). Still another possibility consists in using a standard set of (unique) transactions and assigning to each of them an occurrence counter. Here I use the bag/multiset terminology, even though an occasional "set of transactions" may have slipped through. (This should always be read as "bag/multiset of transactions".)

Note that the item base (the set of all considered items) is often not given explicitly, but only implicitly as the union of all given transactions. This is also the case for my Carpenter program, which by default only takes a transaction file as input. However, it is also possible to specify the item base explicitly with an optional item selection file (option -R; discussed in this section). This can be useful, for example, if one wants to restrict the analysis to a subset of all items.

back to the top top

Support of an Item Set

Let S be an item set and T the bag/multiset of all transactions under consideration. Then the absolute support (or simply the support) of the item set S is the number of transactions in T that contain S. Likewise, the relative support of S is the fraction (or percentage) of the transactions in T which contain S.

More formally, let S be an item set and U = { X∈T | S⊆t } the bag/multiset of all transactions in T that have S as a subset (i.e. contain all of the items in S and possibly some others). Then

suppabs(S) = |U|  = |{ X∈T | S⊆t }|

is the absolute support of S and

supprel(S) = (|U| / |T|) *100%

is the relative support of S. Here |U| and |T| are the number of elements in U and T, respectively.

An important insight is that support (regardless of whether it is stated in an absolute or relative manner) is anti-monotone, that is

∀ I, J: (J ⊆ I) ⇒ (supp(J) ≥ supp(I)).

That is, if an item set is enlarged by adding one or more items, its support cannot increase.

In a supermarket setting, the item set S may be a set like S = { bread, wine, cheese } and T may be the bag/multiset of all "baskets" or "carts" of products bought by the customers of a supermarket – in a given week if you like. U is the bag/multiset of all transactions in T that contain all items in S (and maybe also some other items). For example, if a customer buys the set X = { milk, bread, apples, wine, sausages, cheese, onions, potatoes }, then S is obviously a subset of X, hence X is in U. If there are 318 customers, each giving rise to one transaction, and 242 of these customers bought such a set X or a similar one that contains S, while the other customers bought sets of products that lacked at least one of the items in S, then suppabs(S) = 242 and supprel(S) = 242/318 = 76.1%.

The goal of frequent item set mining is to find all item sets (that is, all subsets of the item base) that occur in the given bag/multiset of transactions with at least a user-specified minimum support suppmin. Such item sets are called frequent item sets.

Since support is anti-monotone, it follows immediately that

∀ I, J: (supp(I) < suppmin ∧ J ⊇ I) ⇒ (supp(J) < suppmin).

This relationship is called the apriori property, which allows to prune the search for frequent item sets effectively: no supersets of an infrequent item set can be frequent and thus need not be visited in the search.

The default value for the minimum support in my Carpenter program is 10% (the percentage indicates implicitly that it refers to relative support). This value can be changed with the option -s. Note that the argument to this option is interpreted as a percentage if it is positive, but if it is negative, it is interpreted as an absolute number (number of transactions) rather than a percentage. That is, -s20 means a minimum relative support of 20%, while -s-20 means a minimum absolute support of 20 transactions.

back to the top top

Target Types

The target type, which can be selected via the option -t, is either closed item sets (default, option -tc) or maximal item sets (option -tm). Note that unlike most other frequent item set mining programs, the current version of my Carpenter implementation can not output all frequent item sets.

More detailed information about the different target types of frequent item set mining can be found in the survey [Borgelt 2012].


Closed Item Sets (option -tc)

A frequent item set is called closed if no superset has the same support (or, in other words, if all supersets have a lower support). Formally, an item set I is called closed iff

∀ J ⊃ I: supp(J) < supp(I).

If the option -tc is given, the found frequent item sets are subsequently filtered and only the closed item sets are reported.


Maximal Item Sets (option -tm)

A frequent item set is called maximal if no superset is frequent, that is, has a support reaching or exceeding the minimum support. Formally, an item set I is called maximal iff

∀ J ⊃ I: supp(J) < suppmin.

If the option -tm is given, the found frequent item sets are subsequently filtered and only the maximal item sets are reported.

back to the top top

Program Invocation

My Carpenter implementation is a command line program that has to be called in a terminal or command window or from a shell script. If double-clicked in a file manager (e.g. in Microsoft Windows), it merely prints a usage message to a terminal/command window that is briefly opened and immediately closed again. This does not mean that the program does not work. In order to run the program, open a terminal/command window, change directory to where you stored the program, and then type an invocation command.

The general syntax of the program invocation is

carpenter [options] infile [outfile]

The first argument infile, which is mandatory, is the name of a file that contains the transactions to analyze. The format of this file is described in this section. If instead of a file name a single minus sign "-" or an empty string "" is given, the input is read from standard input rather than from a file.

The second argument outfile, which is optional (as indicated by the brackets), is the name of a file to which the found frequent item sets are to be written. That it is optional is useful for benchmark tests, where the time it takes to write the output to a file can conceal the actual search time, or if only a pattern spectrum (number of found frequent item sets collected by size and (absolute) support; option -P) is to be found. The format in which frequent item sets are written to the output file is described in this section. If instead of a file name a single minus sign "-" or an empty string "" is given, the output is written to standard output rather than to a file.

In addition to the input and output file several options can be given, all of which consist of a minus sign and a single letter. The full list of options can be found in the next section.

Some options take a parameter. For example, the option -s, with which the minimum support is specified, takes a number as a parameter, which must follow the letter s without any separating space. A space between the option character and its parameter is only allowed if the parameter is a string, for example, a file name. (However, even in this case the parameter may follow the option letter directly.) If the parameter is a number or a single letter, it must follow the option letter directly.

Options may be combined. For example,

carpenter -s10m2n5 input output

means that the minimum support is 10% (option -s), the minimum number of items in an item set is 2 (option -m) and the maximum number of items in an item set is 5 (option -n).

Options may be specified anywhere on the command line, that is, before the input file name, in between the input and output file names, or after the output file name.

If an option is given more than once, the last statement counts. That is, with

carpenter -s10 -s20 input output

the minimum support is 20%, as the -s10 is overwritten by the following -s20.

back to the top top

Program Options

My Carpenter implementation supports the following options
(a # after the option letter means that the option takes a parameter):

optionmeaningdefault
-t# target type
c: closed item sets
m: maximal item sets
c
-m# minimum number of items per item set 1
-n# maximum number of items per item set no limit
-s# minimum support of an item set
positive: percentage of transactions
negative: absolute number of transactions
10
-e# additional evaluation measure
x: no measure
b: binary logarithm of support quotient
x
-d# minimum value of additional evaluation measure
(as a percentage)
10
-q# sort items w.r.t. their frequency
0: do not sort
1: ascending, -1: descending w.r.t. item frequency
2: ascending, -2: descending w.r.t. transaction size sum
2
-p do not collate equal transactions collate
-A# variant of the Carpenter algorithm to use
a: automatic choice based on table size
t: item occurrence counter table
l: transaction identifier lists
o
-x do not prune with perfect extensions prune
-j filter maximal item sets with repository
(needs less memory, but is usually slower)
extra
-y add only maximal item sets to repository
(needs less memory, but is usually slower)
all closed
-F#:#.. support border for filtering item sets
(list of minimum support values, one per item set size,
starting at the minimum size, as given with option -m#)
none
-R# read an item selection from a file
parameter: file name
-P# write a pattern spectrum to a file
parameter: file name
-Z print item set statistics
(number of item sets per size)
-N do not pre-format some integer numbers
-g write output in scanable form
(quote certain characters)
-h# record header for output ""
-k# item separator for output " "
-v# output format for item set information
(changed to " (%a)" if parameter of -s is negative)
" (%S)"
" (%a)"
-w transaction weight in last field only items
-r# record/transaction separators "\n"
-f# field/item separators " \t,"
-b# blank characters " \t\r"
-C# comment characters "#"
-! print additional option information
back to the top top

Input Format

Format of the Transactions File

The transactions file has to be a (text) file structured by field and record separators and blanks. Record separators, not surprisingly, separate records, usually lines, field separators separate fields, usually words. Blanks are used to fill fields, for example, to align them. In the transactions file each record must contain one transaction, i.e. a list of item identifiers, which are separated by field separators. An empty record is interpreted as an empty transaction. That is, it is counted for the total number of transactions, but does not count for the support of any item set other than the empty set. A transaction may contain the same item multiple times (that is, duplicate items do not raise an error), but this multiplicity is disregarded by the Carpenter program. It only considers whether an item is contained in a transaction or not, not how many times an item is contained in a transaction.

Example input files can be found in the directory carpenter/ex in the source package. These files are used in the following to demonstrate how to use the command line options -r, -f, -b and -w. In addition, there are several conversion scripts (for Linux/Unix), with which different common input formats can be converted into the format required by the Carpenter program.

In the file test1.tab transactions are separated by newline characters (that is, each line of the file contains one transaction) and the items of a transaction are separated by spaces. That is, the file test1.tab looks like this:

a b c
a d e
b c d
a b c d
b c
a b d
d e
a b c d
c d e
a b c

As can be seen, there are ten transactions over the item base {a, b, c, d, e}, which contain between two and four items each.

The file test1.tab is in the standard input format and hence it can be processed directly:

carpenter test1.tab test1.out

Instead of spaces, tabulators may be used, and it is possible to mix spaces and tabulators. Note, however, that multiple consecutive white space characters (like multiple spaces or a space and a tabulator etc.) are interpreted as a single field separator. The reason is that by default spaces, tabulators and the carriage return character are interpreted as blank characters, which are removed when reading items. Hence only the first white space character after an item is interpreted as a field separator, all following white space characters are interpreted as blanks.

Note also that commas are among the default field separators as well. That is, if the file test1.tab looked like this:

a,b,c
a,d,e
b,c,d
a,b,c,d
b,c
a,b,d
d,e
a,b,c,d
c,d,e
a,b,c

it could still be processed directly with the command stated above. You may also mix spaces, tabulators and commas.

Unfortunately, though, the fact that commas are interpreted as field separators does not necessarily mean that CSV-files (where CSV stands for "comma separated values"), as they can be written by programs like Microsoft Excel, can be processed directly. The reason is that in a CSV-file all lines contain the same number of fields. That is, in CSV-format, the above input file would look like this (file test1.csv in directory carpenter/ex):

a,b,c,
a,d,e,
b,c,d,
a,b,c,d
b,c,,
a,b,d,
d,e,,
a,b,c,d
c,d,e,
a,b,c,

Note the single and double commas at the end of most lines, which separate empty fields (as these transactions have fewer items than the longest transaction). While a single comma at the end of a line does not cause any problems and is simply ignored, two or more commas lead to an error message "item expected", because an item may not be an empty string. This can be fixed by declaring the comma a blank character (option -b). That is, the CSV-file can be processed with:

carpenter -b, test1.csv test1.out

Note, however, that the characters given with the option -b replace the default blank characters. So if you still need spaces to be interpreted as blanks, they have to be specified as well:

carpenter -b" ," test1.csv test1.out

In the file test2.tab the same transactions can be found, but several different field separators are used:

a,b,c
a,d,e
b.c.d
a,b,c,d
b:c
a,b,d
d,e
a,b,c,d
c;d;e
a,b,c

The file test2.tab can be processed by declaring different field separators with the option -f:

carpenter -f",.;:" -l test2.tab test2.out

The file test3.tab has basically the same format as the file test1.tab, with the only difference that the last fields of each record states an (integer) transaction weight/multiplicity.

a b c 2
a d e 1
b c d 1
a b c d 2
b c 1
a b d 1
d e 1
c d e 1

This allows us to combine transactions, so that test2.tab has only 8 lines, while test1.tab has 10 lines, because the transactions a b c and a b c d occur twice. In order to instruct the Carpenter program to interpret the last field of each record as such a weight/multiplicity, is has to be invoked with the option -w:

carpenter -w test3.tab test3.out

The files test4.tab to test6.tab are in formats that may be common, but which cannot be processed directly with the Carpenter program.

In the file test4.tab each line contains a transaction identifier and an item, separated by a space (not shown because of the large number of lines). This file can be converted (on Linux/Unix systems) into the standard input format with the script tid2set, i.e., with

tid2set test4.tab x.tab

Note that this script sorts the input file (here: test4.tab) w.r.t. the transaction identifier, so that items belonging to the same transaction occupy consecutive lines/records. That is, the input need not already be sorted by transaction identifier, rather the script does this to make sure that the conversion works.

In the file test5.tab the first line states the item names and the following lines contain flags T (true) and F (false) depending on whether the item is contained in the transaction represented by the line or not:

a b c d e
T T T F F
T F F T T
F T T T F
T T T T F
F T T F F
T T F T F
F F F T T
T T T T F
F F T T T
T T T F F

This format can be converted (on Linux/Unix systems) into the standard input format with the script flg2set, i.e., with

flg2set test5.tab x.tab

This script interprets T, t, and 1 as "true", that is, as indicators that the item corresponding to the column is contained in the transaction corresponding to the row, and any other entry as "false" (the item is not contained in the transaction).

In the file test6.tab there is one item per line and transactions are separated by blank lines (not shown here because of the large number of lines). This format can be converted (on Linux/Unix systems) into the standard input format with the script row2set, i.e., with:

row2set test6.tab x.tab

Note that the file test6.tab could be processed directly if the transactions were not separated by a mere empty line (that is, two newline characters following each other), but if the empty line contained a special character, for example %. In this case the file can be processed directly with

carpenter -r"%" -f"\n" -b"\n" test6.tab x.tab

The additional scripts tab2set and hdr2set convert tables with column numbers or column names into a format appropriate for the Carpenter program. They are invoked in the same way as all other scripts discussed above, i.e., with

tab2set a.tab b.tab

or

hdr2set a.tab b.tab

where a.tab is the name of the input file and b.tab the name of the output file. The script tab2set replaces each table entry "x" of the input file by "Xi=x", where i is the column number (starting with 1).

The script hdr2set reads the variable names from the first line of the input file and then replaces each table entry "x" by "X=x", where "X" is the variable name that was found in the corresponding column of the first line. These scripts are handy if you want to process tabular data by treating each table row as a transaction.

Note that any input may also be read from standard input and any output may be sent to standard output, simply by specifying a '-' or an empty string "" instead of a filename. For example (on a Linux/Unix system)

cat test1.tab | carpenter - -

reads the transactions from standard input (where they are fed by the cat command) and writes the found frequent item sets directly to the terminal. They may be piped to any other program or script, since all other messages of the Carpenter program are written to standard error.

back to the top top

Format of the Item Selection File

In addition to an input file with the transactions and an output file for the found item sets, an item selection file may be passed to the Carpenter program with the option -R. This file lists a selection of items that are to be included in the search, while any items not listed in this file are to be ignored. It can be seen as a specification of the item base for the search.

The item selection file has to be a (text) file structured by the same field and record separators and blanks as the transactions file. Items may be listed in a single record or in multiple records; it does not matter in which record an item occurs. All that matters is whether an item occurs in the item selection file (then it is included in the search) or not (then it is ignored).

back to the top top

Output Format

The output format for item sets is fairly flexible. Especially the output of the additional information about item sets (support, additional evaluation) can be formatted freely.


Format of Frequent Item Sets

Each line of the output file contains one item set in the format

A B C [...] <add. info>

where A, B and C are item identifiers, and <add. info> is determined by the additional information output format (option -v). The item separator may be changed with the option -k. For example,

carpenter -k, test1.tab test.out

produces output in the format

A,B,C,[...] <add. info>

Each output line may be preceded by a string that is specified with the option -h (record header for output).

The output format for the additional rule information can be any string with the following special format symbols (similar to the style of the special format symbols of the printf function in the programming language C):

%% a percent sign
%i number of items (item set size)
%a absolute item set support
%s relative item set support as a fraction
%S relative item set support as a percentage
%e additional evaluation measure
%E additional evaluation measure as a percentage

All format specifiers can be extended by an integer number between the percent sign and the defining character, which specifies the number of decimal digits to be printed (expect those that refer to integer numbers, like the number of items and the absolute support, for which such a number of digits is ignored). If no such number is given, (up to) six significant digits are reported.

The default additional information output format is for item sets " (%S)". That is, the relative support of the item set is printed with (up to) six significant digits, enclosed in parentheses. Two spaces separate the additional information from the last item in the set. The default format is automatically changed to " (%a)" if the minimum support (option -s) is given as a negative number (absolute support), so that the output is consistent with the support threshold.


Format of a Pattern Spectrum

A pattern spectrum collects the number of found frequent item sets sub-divided by the size and the support of the item sets. Pattern spectra (together with surrogate data/randomization methods) can be useful to determine the statistical significance of found frequent item sets. Details can be found in the papers [Picado-Muino et al. 2013] and [Torre et al. 2013].

My Carpenter implementation can be instructed to collect and write a pattern spectrum with the option -P, which takes the name of the file to which the patterns spectrum is to be written as a parameter.

A pattern spectrum is written as a list of (size, support, count) triplets. The output file is a simple text file that contains one triplet per line, with the three numbers separated by spaces. For example, for the input file test1.tab (see this section) the pattern spectrum is (with default option settings):

1 3 1
1 6 1
1 7 3
2 1 2
2 3 1
2 4 4
2 5 1
2 6 1
3 1 2
3 2 1
3 3 2
3 4 1
4 2 1

The first column contains the different sizes of item sets, the second column the difference support values, and the third column the number of item sets found for the corresponding (size, support) combination. For example, the sixth row indicates that there are four frequent item sets with two items and support 4. Note that in a pattern spectrum the support is always given as absolute support, that is, as a number of transactions.

back to the top top

License

(MIT license, or more precisely Expat License; to be found in the file mit-license.txt in the directory carpenter/doc in the source package of the program, see also opensource.org and wikipedia.org)

© 2002-2017 Christian Borgelt

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

back to the top top

Download

Download page with most recent version.

back to the top top

Contact

E-mail: christian@borgelt.net
Website: www.borgelt.net
back to the top top

References

An overview of frequent item set mining covering Carpenter and many other algorithms can be found in this survey paper:

A description of some implementation aspects can be found in this paper:

The use of pattern spectra to evaluate the statistical significance of found frequent item sets is explained in these papers:

Other references for the Carpenter algorithm include:

back to the top top

© 2002-2017 Christian Borgelt