Full and Naive Bayes Classifiers

(A Brief Documentation of the Programs bci / bcx / bcdb / corr)

Contents

Introduction

I am sorry that there is no detailed documentation yet. Below you can find a brief explanation of how to induce a full or naive Bayes classsifier with the program bci and how to execute a Bayes classifier with the program bcx. For a list of options, call the programs without any arguments.

Enjoy,
Christian Borgelt

As a simple example for the explanations below I use the dataset in the file bayes/ex/drug.tab, which lists 12 records of patient data (sex, age, and blood pressure) together with an effective drug (effective w.r.t. some unspecified disease). The contents of this file is:

   Sex    Age Blood_pressure Drug
   male   20  normal         A
   female 73  normal         B
   female 37  high           A
   male   33  low            B
   female 48  high           A
   male   29  normal         A
   female 52  normal         B
   male   42  low            B
   male   61  normal         B
   female 30  normal         A
   female 26  low            B
   male   54  high           A
back to the top top

Determining Attribute Domains

To induce a Bayes classifier for the effective drug, one first has to determine the domains of the table columns using the program dom (to be found in the table package, see below):

  dom -a drug.tab drug.dom

The program dom assumes that the first line of the table file contains the column names. (This is the case for the example file drug.tab.) If you have a table file without column names, you can let the program read the column names from another file (using the -h option) or you can let the program generate default names (using the -d option), which are simply the column numbers. The -a option tells the program to determine automatically the column data types. Thus the values of the Age column are automatically recognized as integer values.

After dom has finished, the contents of the file drug.dom should look like this:

  dom(Sex) = { male, female };
  dom(Age) = ZZ;
  dom(Blood_pressure) = { normal, high, low };
  dom(Drug) = { A, B };

The special domain ZZ represents the set of integer numbers, the special domain IR (not used here) the set of real numbers. (The double Z and the I in front of the R are intended to mimic the bold face or double stroke font used in mathematics to write the set of integer or the set of real numbers. All programs that need to read a domain description also recognize a single Z or a single R.)

back to the top top

Inducing a Bayes Classifier

Induce a naive Bayes classifier with the bci program (bci is simply an abbreviation of Bayes Classifier Induction):

  bci drug.dom drug.tab drug.nbc

You need not tell the program bci that the Drug column contains the class, since by default it uses the last column as the class column (the Drug column is the last column in the file drug.tab). If a different column contains the class, you can specify its name on the command line using the -c option, e.g. -c Drug.

At first glance it seems to be superfluous to provide the bci program with a domain description, since it is also given the table file and thus can determine the domains itself. But without a domain description, the bci program would be forced to use all columns in the table file and to use them with the automatically determined data types. But occasions may arise in which you want to induce a naive Bayes classifier from a subset of the columns or in which the numbers in a column are actually coded symbolic values. In such a case the domain file provides a way to tell the bci program about the columns to use and their data types. To ignore a column, simply remove the corresponding domain definition from the domain description file (or comment it out --- C-style (/* ... */) and C++-style (// ... ) comments are supported). To change the data type of a column, simply change the domain definition.

By default the program bci uses all attributes given in the domain description file. However, it can also be instructed to simplify the classifier by using only a subset of the attributes. This is done with the options -sa or -sr (s for simplify), the first of which is used in the example above. With the first option attributes are added one by one (a for add) as long as the classification result improves on the training data. With the second option, attributes are removed one by one (r for remove) as long as the classification result does not get worse.

With the above command the induced naive Bayes classifier is written to the file drug.nbc. The contents of this file should look like this:

  nbc(Drug) = {
    prob(Drug) = {
      A: 6,
      B: 6 };
    prob(Age|Drug) = {
      A: N(36.3333, 161.867) [6],
      B: N(47.8333, 310.967) [6] };
    prob(Blood_pressure|Drug) = {
      A:{ high: 3, low: 0, normal: 3 },
      B:{ high: 0, low: 3, normal: 3 }};
  };

The prior probabilities of the class attribute's values are stated first (as absolute frequencies), followed by the conditional probabilities of the descriptive attributes. For symbolic attributes a simple frequency table is stored. For numeric attributes a normal distribution is used, which is stated as N(μ, σ2) [n]. Here μ is the expected value, σ2 is the variance, and n is the number of tuples these parameters were estimated from. n may differ from the number of cases for the corresponding class, since for some tuples the value of the attribute may be missing.

In this example, however, since there are no missing values, the value of is identical to the number of cases for the corresponding class.

back to the top top

Executing a Bayes Classifier

An induced naive Bayes classifier can be used to classify new data using the program bcx (bcx is simply an abbreviation for Bayes Classifier eXecution):

  bcx drug.nbc drug.tab drug.cls

drug.tab is the table file (since we do not have special test data, we simply use the training data), drug.cls is the output file. After bcx has finished, drug.cls contains (in addition to the columns appearing in the naive Bayes classifier, and, for preclassified data, the class column) a new column bc, which contains the class that is predicted by the naive Bayes classifier. You can give this new column a different name with the -c option, e.g. -c predicted.

If the table contains preclassified data and the name of the column containing the preclassification is the same as for the training data, the error rate of the naive Bayes classifier is determined and printed to the terminal.

The contents of the file drug.cls should look like this:

  Sex    Age Blood_pressure Drug bc
  male   20  normal         A    A
  female 73  normal         B    B
  female 37  high           A    A
  male   33  low            B    B
  female 48  high           A    A
  male   29  normal         A    A
  female 52  normal         B    B
  male   42  low            B    B
  male   61  normal         B    B
  female 30  normal         A    A
  female 26  low            B    B
  male   54  high           A    A

That is, the classification is perfect, which is not surprising for such a simple example. The columns are neatly aligned because of the -a option. Without it, there would only be a single space between two column values.

back to the top top

Computing a Confusion Matrix

The classification quality can be inspected in more detail with the program xmat (determine a confusion matrix, to be found in the table package, see below):

  xmat drug.cls

This program reads the output of the program bcx and computes a confusion matrix from two columns of this file. It uses the last two columns by default (the last column for the x- and the semi-last for the y-direction), which is fine for our example. Other columns can be selected via the options -x and -y followed by the name of the column that is to be used for the x- or y-direction of the confusion matrix. The output of the program xmat, which by default is written to the terminal, should read like this:

  confusion matrix for Drug vs. bc:
   no | value  |      1      2 | errors
  ----+--------+---------------+-------
    1 | A      |      6      0 |      0
    2 | B      |      0      6 |      0
  ----+--------+---------------+-------
      | errors |      0      0 |      0

In this matrix the x-direction corresponds to the column bc and the y-direction to the column Drug. Since in our simple example the classification is perfect, only the fields in the diagonal differ from zero. If the classification is not perfect, the other fields show what errors are made by the decision tree classifier.

back to the top top

Generating a Database

The program bcdb can be used to generate a database of sample cases from a full or naive Bayes classifier. For example, invoking it with

  bcdb test.fbc test.tab

generates a database with 1000 tuples from the full Bayes classifier test.fbc that can be found in the directory ex. The number of tuples to be generated can be changed with the option -n#, where # is to be replaced by the desired number. For other options call the program without any arguments.

back to the top top

Computing Covariances and Correlation Coefficients

The program corr can be used to computed covariances and correlation coefficients. By invoking it with

  correl -xvc iris.tab

the expected values and standard deviations (option -x), the covariances (option -v) and the correlation coefficients (option -c) for the four numeric attributes of the well-known iris data are computed. The output should look like this:

   no | attribute    | exp. val. | std. dev.
   ---+--------------+-----------+----------
    1 | sepal_length |  5.843333 |  0.825301
    2 | sepal_width  |  3.057333 |  0.434411
    3 | petal_length |  3.758000 |  1.759404
    4 | petal_width  |  1.199333 |  0.759693

   covariance matrix
   no | attribute    |        1        2        3        4
   ---+--------------+------------------------------------
    1 | sepal_length |  0.68112 -0.04215  1.26582  0.51283
    2 | sepal_width  |           0.18871 -0.32746 -0.12083
    3 | petal_length |                    3.09550  1.28697
    4 | petal_width  |                             0.57713

   correlation coefficients
   no | attribute    |     1     2     3     4
   ---+--------------+------------------------
    1 | sepal_length |  1.00 -.118  .872  .818
    2 | sepal_width  |        1.00 -.428 -.366
    3 | petal_length |              1.00  .963
    4 | petal_width  |                    1.00
back to the top top

Copying

(MIT license, or more precisely Expat License; to be found in the file mit-license.txt in the directory bayes/doc in the source package of the program, see also opensource.org and wikipedia.org)

© 1999-2016 Christian Borgelt

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

back to the top top

Download

Download page with most recent version.

back to the top top

Contact

E-mail:    christian@borgelt.net
Website: www.borgelt.net
back to the top top

© 1999-2016 Christian Borgelt