Note: This documentation refers to SaM version 3.22
and may not be compatible with other versions.
Call sam without any options or arguments
to check the actually supported options.
SaM (Ssplit and Merge) is a program for frequent item set mining, a data mining method that was originally developed for market basket analysis. Frequent item set mining aims at finding regularities in the shopping behavior of the customers of supermarkets, mail-order companies and online shops. In particular, it tries to identify sets of products that are frequently bought together. Once identified, such sets of associated products may be used to optimize the organization of the offered products on the shelves of a supermarket or the pages of a mail-order catalog or web shop, may give hints which products may conveniently be bundled, or may allow to suggest other products to customers. However, frequent item set mining may be used for a much wider variety of tasks, which share that one is interested in finding regularities between (nominal) variables in a given data set. For an overview of frequent item set mining in general and several specific algorithms (including SaM), see the survey [Borgelt 2012].
This page describes the SaM implementation that I have been developing and improving since 2008. SaM was originally designed for educational purposes, as it shows that a depth-first search on the subset lattice can be combined with a purely horizontal representation of the transaction database. As such it contrasts Eclat, which also carries out a depth-first search, but relies on a purely vertical representation of the transaction database. As a consequence, Eclat determines the support of item sets by intersecting lists of transaction identifiers while SaM sums transaction weights (or multiplicities) in the split step.
The basic algorithmic scheme was first described in [Borgelt and Wang 2009a], which also covers how to find approximate frequent item sets with SaM. The basic idea is that the transactions in the database are noisy observations of the actual transactions, that is, a transaction may actually contain an item even though it was not recorded in the database. In order to handle such a situation, SaM allows an item set to be supported by transactions that contain only part of the items, although these transactions have a lower weight. These weights are computed from item insertion penalties, based on what items have to be inserted into the transaction to make it contain an item set under consideration. Details of the approach as well as how the algorithm needs to be modified to accommodate this can be found in [Borgelt and Wang 2009a] and [Borgelt and Wang 2009b].
Note that the current version of this program can only find frequent item sets, not association rules.
Enjoy,
Christian Borgelt
back to the top |
This section briefly introduces some basic notions needed to talk about frequent item sets. These notions are item, transaction, support and frequent, closed and maximal item set. For a more extensive coverage, consult the survey [Borgelt 2012].
On an abstract level, the input to frequent item set mining consists of a bag or multiset of transactions that are defined over a set of items, sometimes called the item base. These items may be products, special equipment items, service options etc. For an abstract treatment it suffices that items have an identity, that is, it is possible to distinguish one item from another. The item base is the set of all considered items, for example, the set of all products that are sold by a given supermarket, mail-order company or online shop. Any subset of the item base is called an item set.
A transaction is simply an item set and it represents, for example, the set of products bought by a customer. Since two or more customers may, in principle, buy the exact same set of products, we cannot model the whole of all "shopping baskets" or "shopping carts" (bought, say, in a given week) as a set of transactions, since in a set each element is unique. There are several solutions to this problem: one may, for example, model the whole of all transactions as a bag or multiset (a generalization of a set, which allows for multiple occurrences of the same element) or as a vector (where elements at different positions may be the same, but are still distinguished by their position), or by extending each transaction with a unique transaction identifier (or tid for short; note that the position of a transaction in a vector representation is an implicit transaction identifier). Still another possibility consists in using a standard set of (unique) transactions and assigning to each of them an occurrence counter. Here I use the bag/multiset terminology, even though an occasional "set of transactions" may have slipped through. (This should always be read as "bag/multiset of transactions".)
Note that the item base (the set of all considered items) is often not given explicitly, but only implicitly as the union of all given transactions. This is also the case for my SaM program, which by default only takes a transaction file as input. However, it is also possible to specify the item base explicitly with an optional item selection file (option -R; discussed in this section). This can be useful, for example, if one wants to restrict the analysis to a subset of all items.
back to the top |
Let S be an item set and T the bag/multiset of all transactions under consideration. Then the absolute support (or simply the support) of the item set S is the number of transactions in T that contain S. Likewise, the relative support of S is the fraction (or percentage) of the transactions in T which contain S.
More formally, let S be an item set and U = { X∈T | S⊆t } the bag/multiset of all transactions in T that have S as a subset (i.e. contain all of the items in S and possibly some others). Then
suppabs(S) = |U| = |{ X∈T | S⊆t }|
is the absolute support of S and
supprel(S) = (|U| / |T|) *100%
is the relative support of S. Here |U| and |T| are the number of elements in U and T, respectively.
An important insight is that support (regardless of whether it is stated in an absolute or relative manner) is anti-monotone, that is
∀ I, J: (J ⊆ I) ⇒ (supp(J) ≥ supp(I)).
That is, if an item set is enlarged by adding one or more items, its support cannot increase.
In a supermarket setting, the item set S may be a set like S = { bread, wine, cheese } and T may be the bag/multiset of all "baskets" or "carts" of products bought by the customers of a supermarket – in a given week if you like. U is the bag/multiset of all transactions in T that contain all items in S (and maybe also some other items). For example, if a customer buys the set X = { milk, bread, apples, wine, sausages, cheese, onions, potatoes }, then S is obviously a subset of X, hence X is in U. If there are 318 customers, each giving rise to one transaction, and 242 of these customers bought such a set X or a similar one that contains S, while the other customers bought sets of products that lacked at least one of the items in S, then suppabs(S) = 242 and supprel(S) = 242/318 = 76.1%.
The goal of frequent item set mining is to find all item sets (that is, all subsets of the item base) that occur in the given bag/multiset of transactions with at least a user-specified minimum support suppmin. Such item sets are called frequent item sets.
Since support is anti-monotone, it follows immediately that
∀ I, J: (supp(I) < suppmin ∧ J ⊇ I) ⇒ (supp(J) < suppmin).
This relationship is called the apriori property, which allows to prune the search for frequent item sets effectively: no supersets of an infrequent item set can be frequent and thus need not be visited in the search.
The default value for the minimum support in my SaM program is 10% (the percentage indicates implicitly that it refers to relative support). This value can be changed with the option -s. Note that the argument to this option is interpreted as a percentage if it is positive, but if it is negative, it is interpreted as an absolute number (number of transactions) rather than a percentage. That is, -s20 means a minimum relative support of 20%, while -s-20 means a minimum absolute support of 20 transactions.
back to the top |
An annoying problem in frequent item set mining is that the number of frequent item sets is often huge and thus the output can easily exceed the size of the transaction database to mine. In order to mitigate this problem, several restrictions of the set of frequent item sets have been suggested. These restrictions are covered by the target type.
The target type, which can be selected via the option -t, is either frequent item sets (default, option -ts), closed item sets (option -tc), or maximal item sets (option -tm).
More detailed information about the different target types of frequent item set mining can be found in the survey [Borgelt 2012].
Often one only wants to find frequent item sets. That is, one wants to find all item sets with a support exceeding a certain threshold, the so-called minimum support. For my SaM program this is the default operation mode. However, this mode can also be selected explicitly with the option -ts.
A frequent item set is called closed if no superset has the same support (or, in other words, if all supersets have a lower support). Formally, an item set I is called closed iff
∀ J ⊃ I: supp(J) < supp(I).
If the option -tc is given, the found frequent item sets are subsequently filtered and only the closed item sets are reported.
A frequent item set is called maximal if no superset is frequent, that is, has a support reaching or exceeding the minimum support. Formally, an item set I is called maximal iff
∀ J ⊃ I: supp(J) < suppmin.
If the option -tm is given, the found frequent item sets are subsequently filtered and only the maximal item sets are reported.
My SaM implementation is a command line program that has to be called in a terminal or command window or from a shell script. If double-clicked in a file manager (e.g. in Microsoft Windows), it merely prints a usage message to a terminal/command window that is briefly opened and immediately closed again. This does not mean that the program does not work. In order to run the program, open a terminal/command window, change directory to where you stored the program, and then type an invocation command.
The general syntax of the program invocation is
sam [options] infile [outfile]
The first argument infile, which is mandatory, is the name of a file that contains the transactions to analyze. The format of this file is described in this section. If instead of a file name a single minus sign "-" or an empty string "" is given, the input is read from standard input rather than from a file.
The second argument outfile, which is optional (as indicated by the brackets), is the name of a file to which the found frequent item sets are to be written. That it is optional is useful for benchmark tests, where the time it takes to write the output to a file can conceal the actual search time, or if only a pattern spectrum (number of found frequent item sets collected by size and (absolute) support; option -P) is to be found. The format in which frequent item sets are written to the output file is described in this section. If instead of a file name a single minus sign "-" or an empty string "" is given, the output is written to standard output rather than to a file.
In addition to the input and output file several options can be given, all of which consist of a minus sign and a single letter. The full list of options can be found in the next section.
Some options take a parameter. For example, the option -s, with which the minimum support is specified, takes a number as a parameter, which must follow the letter s without any separating space. A space between the option character and its parameter is only allowed if the parameter is a string, for example, a file name. (However, even in this case the parameter may follow the option letter directly.) If the parameter is a number or a single letter, it must follow the option letter directly.
Options may be combined. For example,
sam -s10m2n5 input output
means that the minimum support is 10% (option -s), the minimum number of items in an item set is 2 (option -m) and the maximum number of items in an item set is 5 (option -n).
Options may be specified anywhere on the command line, that is, before the input file name, in between the input and output file names, or after the output file name.
If an option is given more than once, the last statement counts. That is, with
sam -s10 -s20 input output
the minimum support is 20%, as the -s10 is overwritten by the following -s20.
back to the top |
My SaM implementation supports the following options
(a # after the option letter means that the option
takes a parameter):
option | meaning | default |
---|---|---|
-t# | target type s: frequent item sets c: closed (frequent) item sets m: maximal (frequent) item sets |
s |
-m# | minimum number of items per item set | 1 |
-n# | maximum number of items per item set | no limit |
-s# | minimum (standard) support of an item set positive: percentage of transactions negative: absolute number of transactions |
10 |
-S# | maximum support of an item set positive: percentage of transactions negative: absolute number of transactions |
100 |
-i# | minimum support with item insertions (only with item insertions, option -u) positive: percentage of transactions negative: absolute number of transactions |
10 |
-T# | t-norm for combining item penalties m: minimum T(a,b) = min(a,b) n: nil-potent minimum T(a,b) = min(a,b) if a+b > 1 else 0 p: product T(a,b) = a*b l: Łukasiewicz t-norm T(a,b) = max(0,a+b-1) h: Hamacher product T(a,b) = 0 if a = b = 0 else a*b/(a+b-a*b) |
p |
-u# | minimum weight of a transaction (a value >= 0 selects item insertions) |
-1 |
-e# | additional evaluation measure x: no measure b: binary logarithm of support quotient |
x |
-d# | threshold for additional evaluation measure (as a percentage) |
10 |
-q# | sort items w.r.t. their frequency 0: do not sort 1: ascending, -1: descending w.r.t. item frequency 2: ascending, -2: descending w.r.t. transaction size sum |
2 |
-A# | variant of the SaM algorithm to use s: basic split and merge search b: split and merge with binary search d: split and merge with double source buffering t: split and merge with transaction prefix tree |
b |
-y | threshold for transaction source merging (for algorithm variant 'b', option -ab) |
8192 |
-x | do not prune with perfect extensions | prune |
-l# | number of items for k-items machine (only for algorithm variants i,r,o) |
16 |
-F#:#.. | support border for filtering item sets (list of minimum support values, one per item set size, starting at the minimum size, as given with option -m#) |
none |
-R# | read item selection/insertion penalties from a file parameter: file name (insertion penalities if -u is given with a parameter >= 0) |
|
-P# | write a pattern spectrum to a file parameter: file name |
|
-Z | print item set statistics (number of item sets per size) |
|
-N | do not pre-format some integer numbers | |
-g | write output in scanable form (quote certain characters) |
|
-h# | record header for output | "" |
-k# | item separator for output | " " |
-v# | output format for item set information (changed to " (%a)" if parameter of -s is negative) |
" (%S)" " (%a)" |
-w | transaction weight in last field | only items |
-r# | record/transaction separators | "\n" |
-f# | field/item separators | " \t," |
-b# | blank characters | " \t\r" |
-C# | comment characters | "#" |
-! | print additional option information |
back to the top |
The transactions file has to be a (text) file structured by field and record separators and blanks. Record separators, not surprisingly, separate records, usually lines (since the default record separator is the newline character), field separators separate fields, usually words (since among the default field separators are the space and the tabulator, but also the comma). Blanks are used to fill fields, for example, to align them. In the transactions file each record must contain one transaction, i.e. a list of item identifiers, which are separated by field separators. An empty record is interpreted as an empty transaction. That is, it is counted for the total number of transactions, but does not count for the support of any item set other than the empty set. A transaction may contain the same item multiple times (that is, duplicate items do not raise an error), but this multiplicity is disregarded by the SaM program. It only considers whether an item is contained in a transaction or not, not how many times an item is contained in a transaction.
Example input files can be found in the directory sam/ex in the source package. These files are used in the following to demonstrate how to use the command line options -r, -f, -b and -w. In addition, there are several conversion scripts (for Linux/Unix), with which different common input formats can be converted into the format required by the SaM program.
In the file test1.tab transactions are separated by newline characters (that is, each line of the file contains one transaction) and the items of a transaction are separated by spaces. That is, the file test1.tab looks like this:
a b c
a d e
b c d
a b c d
b c
a b d
d e
a b c d
c d e
a b c
As can be seen, there are ten transactions over the item base {a, b, c, d, e}, which contain between two and four items each.
The file test1.tab is in the standard input format and hence it can be processed directly:
sam test1.tab test1.out
Instead of spaces, tabulators may be used, and it is possible to mix spaces and tabulators. Note, however, that multiple consecutive white space characters (like multiple spaces or a space and a tabulator etc.) are interpreted as a single field separator. The reason is that by default spaces, tabulators and the carriage return character are interpreted as blank characters, which are removed when reading items. Hence only the first white space character after an item is interpreted as a field separator, all following white space characters are interpreted as blanks.
Note also that commas are among the default field separators as well. That is, if the file test1.tab looked like this:
a,b,c
a,d,e
b,c,d
a,b,c,d
b,c
a,b,d
d,e
a,b,c,d
c,d,e
a,b,c
it could still be processed directly with the command stated above. You may also mix spaces, tabulators and commas.
Unfortunately, though, the fact that commas are interpreted as field separators does not necessarily mean that CSV-files (where CSV stands for "comma separated values"), as they can be written by programs like Microsoft Excel, can be processed directly. The reason is that in a CSV-file all lines contain the same number of fields. That is, in CSV-format, the above input file would look like this (file test1.csv in directory sam/ex):
a,b,c,
a,d,e,
b,c,d,
a,b,c,d
b,c,,
a,b,d,
d,e,,
a,b,c,d
c,d,e,
a,b,c,
Note the single and double commas at the end of most lines, which separate empty fields (as these transactions have fewer items than the longest transaction). While a single comma at the end of a line does not cause any problems and is simply ignored, two or more commas lead to an error message "item expected", because an item may not be an empty string. This can be fixed by declaring the comma a blank character (option -b). That is, the CSV-file can be processed with:
sam -b, test1.csv test1.out
Note, however, that the characters given with the option -b replace the default blank characters. So if you still need spaces to be interpreted as blanks, they have to be specified as well:
sam -b" ," test1.csv test1.out
In the file test2.tab the same transactions can be found, but several different field separators are used:
a,b,c
a,d,e
b.c.d
a,b,c,d
b:c
a,b,d
d,e
a,b,c,d
c;d;e
a,b,c
The file test2.tab can be processed by declaring different field separators with the option -f:
sam -f",.;:" -l test2.tab test2.out
The file test3.tab has basically the same format as the file test1.tab, with the only difference that the last fields of each record states an (integer) transaction weight/multiplicity.
a b c 2
a d e 1
b c d 1
a b c d 2
b c 1
a b d 1
d e 1
c d e 1
This allows us to combine transactions, so that test2.tab has only 8 lines, while test1.tab has 10 lines, because the transactions a b c and a b c d occur twice. In order to instruct the SaM program to interpret the last field of each record as such a weight/multiplicity, is has to be invoked with the option -w:
sam -w test3.tab test3.out
The files test4.tab to test6.tab are in formats that may be common, but which cannot be processed directly with the SaM program.
In the file test4.tab each line contains a transaction identifier and an item, separated by a space (not shown because of the large number of lines). This file can be converted (on Linux/Unix systems) into the standard input format with the script tid2set, i.e., with
tid2set test4.tab x.tab
Note that this script sorts the input file (here: test4.tab) w.r.t. the transaction identifier, so that items belonging to the same transaction occupy consecutive lines/records. That is, the input need not already be sorted by transaction identifier, rather the script does this to make sure that the conversion works.
In the file test5.tab the first line states the item names and the following lines contain flags T (true) and F (false) depending on whether the item is contained in the transaction represented by the line or not:
a b c d e
T T T F F
T F F T T
F T T T F
T T T T F
F T T F F
T T F T F
F F F T T
T T T T F
F F T T T
T T T F F
This format can be converted (on Linux/Unix systems) into the standard input format with the script flg2set, i.e., with
flg2set test5.tab x.tab
This script interprets T, t, and 1 as "true", that is, as indicators that the item corresponding to the column is contained in the transaction corresponding to the row, and any other entry as "false" (the item is not contained in the transaction).
In the file test6.tab there is one item per line and transactions are separated by blank lines (not shown here because of the large number of lines). This format can be converted (on Linux/Unix systems) into the standard input format with the script row2set, i.e., with:
row2set test6.tab x.tab
Note that the file test6.tab could be processed directly if the transactions were not separated by a mere empty line (that is, two newline characters following each other), but if the empty line contained a special character, for example %. In this case the file can be processed directly with
sam -r"%" -f"\n" -b"\n" test6.tab x.tab
The additional scripts tab2set and hdr2set convert tables with column numbers or column names into a format appropriate for the SaM program. They are invoked in the same way as all other scripts discussed above, i.e., with
tab2set a.tab b.tab
or
hdr2set a.tab b.tab
where a.tab is the name of the input file and b.tab the name of the output file. The script tab2set replaces each table entry "x" of the input file by "Xi=x", where i is the column number (starting with 1).
The script hdr2set reads the variable names from the first line of the input file and then replaces each table entry "x" by "X=x", where "X" is the variable name that was found in the corresponding column of the first line. These scripts are handy if you want to process tabular data by treating each table row as a transaction.
Note that any input may also be read from standard input and any output may be sent to standard output, simply by specifying a '-' or an empty string "" instead of a filename. For example (on a Linux/Unix system)
cat test1.tab | sam - -
reads the transactions from standard input (where they are fed by the cat command) and writes the found frequent item sets directly to the terminal. They may be piped to any other program or script, since all other messages of the SaM program are written to standard error.
back to the top |
In addition to an input file with the transactions and an output file for the found item sets, an item selection/insertion penalties file may be passed to the SaM program with the option -R. Without the option -u, this file lists a selection of items that are to be included in the search, while any items not listed in this file are to be ignored. It can be seen as a specification of the item base for the search.
The item selection file has to be a (text) file structured by the same field and record separators and blanks as the transactions file. Items may be listed in a single record or in multiple records; it does not matter in which record an item occurs. All that matters is whether an item occurs in the item selection file (then it is included in the search) or not (then it is ignored).
If the option -u is given with a parameter ≥ 0, the file passed with the option -R is used to specify item insertion penalties. Insertion penalties are real-valued numbers in [0,1]. The insertion penalty is the higher, the smaller the value, because the insertion penalties are combined with a t-norm (to be specified with the option -N) with the transaction weight, so that transactions which only with item insertions can be made to support an item set contribute less to the support of an item set. A value of 1 means no penalty at all. A value of 0 means that the penalty is so high that an insertion is impossible.
The first record of an item insertion penalties file, which must have one field, contains the default insertion penalty as a real-valued number in [0,1]. This is used for all items that are not explicitly listed in one of the following records. The following records state the appearance of specific items, one per record. The first field of these records states the item, the second the insertion penalty as a real-valued number in [0,1].
Example 1: Allow only items a and b to be inserted, each with a penalty of 0.5.
0
a 0.5
b 0.5
Example 2: Any item may be inserted with a penalty of 0.8, except a and b, which may not be inserted, and item c, which carries the more severe insertion penalty of 0.6.
0.8
a 0
b 0
c 0.6
Providing no item insertion penalties file (if the option -u is given) is equivalent to an item insertion penalties file containing only a default insertion penalty of 0, which does not allow any item to be inserted.
back to the top |
The output format for item sets is fairly flexible. Especially the output of the additional information about item sets (support, additional evaluation) can be formatted freely.
Each line of the output file contains one item set in the format
A B C [...] <add. info>
where A, B and C are item identifiers, and <add. info> is determined by the additional information output format (option -v). The item separator may be changed with the option -k. For example,
sam -k, test1.tab test.out
produces output in the format
A,B,C,[...] <add. info>
Each output line may be preceded by a string that is specified with the option -h (record header for output).
The output format for the additional rule information can be any string with the following special format symbols (similar to the style of the special format symbols of the printf function in the programming language C):
%% | a percent sign | |
%i | number of items (item set size) | |
%a | absolute item set support | |
%s | relative item set support as a fraction | |
%S | relative item set support as a percentage | |
%w | absolute support with insertions | |
%r | relative support with insertions as a fraction | |
%R | relative support with insertions as a percentage | |
%e | additional evaluation measure | |
%E | additional evaluation measure as a percentage |
All format specifiers can be extended by an integer number between the percent sign and the defining character, which specifies the number of decimal digits to be printed (expect those that refer to integer numbers, like the number of items and the absolute support, for which such a number of digits is ignored). If no such number is given, (up to) six significant digits are reported.
The default additional information output format is for item sets " (%S)". That is, the relative support of the item set is printed with (up to) six significant digits, enclosed in parentheses. Two spaces separate the additional information from the last item in the set. The default format is automatically changed to " (%a)" if the minimum support (option -s) is given as a negative number (absolute support), so that the output is consistent with the support threshold.
A pattern spectrum collects the number of found frequent item sets sub-divided by the size and the support of the item sets. Pattern spectra (together with surrogate data/randomization methods) can be useful to determine the statistical significance of found frequent item sets. Details can be found in the papers [Picado-Muino et al. 2013] and [Torre et al. 2013].
My SaM implementation can be instructed to collect and write a pattern spectrum with the option -P, which takes the name of the file to which the patterns spectrum is to be written as a parameter.
A pattern spectrum is written as a list of (size, support, count) triplets. The output file is a simple text file that contains one triplet per line, with the three numbers separated by spaces. For example, for the input file test1.tab (see this section) the pattern spectrum is (with default option settings):
1 3 1
1 6 1
1 7 3
2 1 2
2 3 1
2 4 4
2 5 1
2 6 1
3 1 2
3 2 1
3 3 2
3 4 1
4 2 1
The first column contains the different sizes of item sets, the second column the difference support values, and the third column the number of item sets found for the corresponding (size, support) combination. For example, the sixth row indicates that there are four frequent item sets with two items and support 4. Note that in a pattern spectrum the support is always given as absolute support, that is, as a number of transactions.
back to the top |
(MIT license, or more precisely Expat License; to be found in the file mit-license.txt in the directory sam/doc in the source package of the program, see also opensource.org and wikipedia.org)
© 2008-2017 Christian Borgelt
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
back to the top |
Download page with most recent version.
back to the top |
E-mail: |
christian@borgelt.net | |
Website: |
www.borgelt.net |
back to the top |
An overview of frequent item set mining covering SaM and many other algorithms can be found in this survey paper:
A description of some implementation aspects of the SaM algorithm can be found in the following papers:
The use of pattern spectra to evaluate the statistical significance of found frequent item sets is explained in these papers:
back to the top |