CoCoNAD+PSF+PSR

Continuous Time Closed Neuron Assembly Detection
with Pattern Spectrum Filtering and Pattern Set Reduction

Contents

back to the top top

Introduction

CoCoNAD (for Continuous Time Closed Neuron Assembly Detection) is an algorithm for finding frequent parallel episodes or frequent (approximately) synchronous events in parallel point processes, for example, in parallel spike trains as they are recorded with multi-electrode devices in neurobiology. The script/program ccn+psf+psr.py is a command line interface to CoCoNAD, which is written in Python and accesses a C-based implementation of the CoCoNAD algorithm through a Python extension module. It also comprises the methods of Pattern Spectrum Filtering (PSF) and Pattern Set Reduction (PSR), with which the abundance of frequent patterns that are usually found in a given data set can be reduced to a smaller set of significant patterns.

If instead of a command line interface (as ccn+psf+psr.py provides it) a graphical user interface for the CoCoNAD method as well as pattern spectrum filtering (PSF) and pattern set reduction (PSR) is desired, the Java program CoCoGUI that is made available on the download page for the CoCoGUI program is worth looking at. This user interface accesses the same C implementation of the CoCoNAD algorithm through a library that employs the Java Native Interface (JNI) to make C-based functions available in Java. It also comprises the methods of Pattern Spectrum Filtering (PSF) and Pattern Set Reduction (PSR) as well as viewers for parallel (spike) train data, pattern spectra and pattern sets.

General information about the theory underlying the CoCoNAD algorithm can be found in [Picado-Muiño and Borgelt 2013], while implementation aspects are discussed in [Borgelt and Picado-Muiño 2013]. The method of pattern spectrum filtering was introduced in [Picado-Muiño et al. 2013] (although for time-binned data, while CoCoNAD works on continuous data, which requires some adaptations) and was discussed in more detail as well as extended by methods for pattern set reduction in [Torre et al. 2013] (again for time-binned data, but the core ideas can be transferred to the continuous time operation of CoCoNAD). It is recommended to study these papers before applying the script/program ccn+psf+psr.py to your own data. The following is only a very brief summary/review of some fundamental principles.

The core idea of CoCoNAD is to define a set of (approximately) synchronous events or a parallel episode in parallel (spike) trains or point processes as a group of events (action potentials or spikes in the neurobiology application) that occur with a user-specified maximum distance (in time) from each other. CoCoNAD then looks for frequent synchronous events or frequent parallel episodes, where the frequency or the support of a group of items (event types, or neurons in the neurobiological application) is measured by a maximum independent set approach: each set of items/events (spikes) of the same types (or: each parallel episode for the same items/event types, each set of approximately synchronous spikes from the same set of neurons) is an element of a family of sets. The size of a maximum independent set of this family, that is, of a selection of sets that have only empty pairwise intersections (independent set) and which comprises as many sets as possible (maximum independent set), is the support of the set of items/neurons. From a neurobiological point of view, this is a very plausible and intuitive support measure, since it ensures that no single event (spike) is counted more than once for the support of a given set of items/neurons. From a computer science/data analysis point of view, this support measure has the advantage that it is anti-monotone (or downward closed) and thus allows for effective and efficient pruning of the search for frequent synchronous events/parallel episodes.

The core idea underlying pattern spectrum filtering (PSF) is that among the abundance of frequent patterns found with CoCoNAD most are chance occurrences. That is, the majority of these patterns are such that they are likely to be observed even if there is no regularity in the (co-)occurrence of the events (that is, if the spike trains are independent or if events of different types occur independently). To filter out these chance patterns and thus reduce the patterns to the (likely) significant ones, surrogate data sets are generated by randomizing and permuting the original data, with the aim of destroying any (regular) synchronous activity, but keeping as many other properties of the data as possible (like the number of points per train/spikes per neuron, local event/spike densities etc). These surrogate data sets implicitly represent the null hypothesis of independent point processes/(spike) trains. By generating and analyzing a sufficiently large number of surrogate data sets, a pattern spectrum is generated. A pattern spectrum maps pattern signatures, that is, pairs of the size of a pattern (its number of items/neurons) and its support (number of (co-)occurrences in the maximum independent set sense), to occurrence counts, that is, to the (average) number of patterns with such a signature that have been observed in the surrogate data sets. It is then assumed that any patterns found in the original data that possess signatures that occur also in at least one surrogate data set can safely be considered chance events (since a counterpart — same signature — occurred in presumably independent data) and thus can be removed from the patterns found in the original data.

As a faster alternative to generating a pattern spectrum by analyzing a large number of surrogate data sets, an estimation method is provided, which evaluates the characteristics of the data to produce an approximation of a pattern spectrum. Although this method yields only an approximation of an actual pattern spectrum, it has the advantage that it produces a pattern spectrum orders of magnitude faster and thus is the method of choice for larger data sets (more event types/neurons, longer recording periods) and for larger maximally allowed distances between events that are to be considered synchronous. Due to this speed advantage, estimating a pattern spectrum has been made the default operation mode.

The patterns that remain after pattern spectrum filtering are then further reduced by analyzing subset and superset relationships between them, since chance coincidences of subsets of found patterns and chance coincidences of the whole set with items/neurons that are not part of an actual pattern can lead to smaller patterns with higher support and larger patterns with lower support, respectively, which are not significant themselves, but only induced by an actually significant pattern. The core principle of the pattern set reduction (PSR) step is to define a preference relation between patterns of which one is a subset of the other and then to keep only patterns to which no other pattern is preferred.

back to the top top

Program Invocation

The Python script/program ccn+psf+psr.py is distributed together with several other (supporting) scripts in the archives psf+psr.zip or psf+psr.tar.gz (same contents, choose according to your preference) that are available on the download page for the PyCoCo library. One of these archives needs to be unpacked to some directory to get access to the main script/program ccn+psf+psr.py and its supporting scripts.

It is highly recommended to install the PyCoCo library as well (a Python extension module, which makes a fast C-based implementation of the CoCoNAD algorithm as well as an efficient parallelized version of pattern spectrum generation available in Python). The Python scripts will also work without this library, but at a high price w.r.t. the execution time. Using the pure Python scripts can be slower by a factor of 40 or even more in certain cases.

Apart from the main script/program ccn+psf+psr.py there are several supporting scripts, some of which are always needed, while other are only fallen back on if the PyCoCo library is not installed. The full set of scripts comprises:

The general syntax of the program invocation is

[python] ccn+psf+psr.py [options] infile [outfile]

Here [python] stands for the name of the Python interpreter that is to execute the script/program ccn+psf+psr.py. The script/program should work with both Python 2.7.x and Python 3.x. (Note, however, that the PyCoCo library needs to be compiled specifically for the Python version to be used. On the download page for the PyCoCo library precompiled version are available only for Python 2.7.x.)

On a GNU/Linux system, the script/program ccn+psf+psr.py may also be called directly (provided the executable flag of the file is set, otherwise execute the command "chmod +x ccn+psf+psr.py"), because the first line of this script/program indicates that /usr/bin/python should be used as the interpreter. This is the reason why the Python interpreter is indicated as optional (by the enclosing brackets). In the following, example command lines will be given in this format, that is, without an explicit statement of the Python interpreter. However, if the script is called directly, it may have to be specified as ./ccn+psf+psr.py, unless the current directory is on your PATH environment variable. (Note that this is usually not the case by default, due to computer security concerns.)

On Microsoft Windows, however, it is mandatory to specify this interpreter, since there is no way to specify the Python interpreter in the script itself in Microsoft Windows (I may be mistaken about this; if you know how to do it, please let me know.) Whether the interpreter can be stated merely as python or whether a full path to the Python executable has to be given depends on whether the path to the directory in which the Python interpreter resides in on the PATH environment variable or not. As this is system specific, no general statement can be made.

The first argument infile, which is mandatory, is the name of a file that contains the point processes/spike trains to analyze. The format of this file is described in this section.

The second argument outfile, which is optional (as indicated by the brackets), is the name of a file to which the found patterns (after pattern spectrum filtering and pattern set reduction) are to be written. That it is optional may be useful for benchmark tests, although after pattern spectrum filtering and pattern set reduction usually only few patterns remain, so that the time it takes to write the output to a file is likely too small to conceal the actual search time. However, it may be useful if only a pattern spectrum (that is, a mapping from pattern signature to occurrence counts; option -P#) is to be found. The format in which found patterns are written to the output file is described in this section.

In addition to the input and output file(s) several options can be given, all of which consist of a minus sign and a single letter. The full list of options can be found in the next section.

Some options take a parameter. For example, the option -s#, with which the minimum support is specified, takes a number as a parameter, which must follow the letter s without any separating space. The option -P#, which indicates that a pattern spectrum file is to be written, requires the name of the file as an argument. Like for the option -s# this file name must follow the option character P directly, that is, without a separating space. The format of the pattern spectrum file is described in this section.

Options may be specified anywhere on the command line, that is, before the input file name, in between the input and output file names, or after the output file name.

If an option is given more than once, the last statement counts. That is, with

ccn+psf+psr.py -s2 -s3 input output

the minimum support is 3, as the -s2 is overwritten by the following -s3.

To test the operation of the script/program ccn+psf+psr.py, it is recommended to retrieve the example file trains.txt from the download page. If this file is placed into the same directory from which the script/program ccn+psf+psr.py was started, it can be used directly as the (spike) trains input data file. For example,

ccn+psf+psr.py trains.txt -Pspectrum.txt patterns.txt

processes this file with the default options. This should take between 20 and 60 seconds on a reasonably recent computer and should finally write three patterns to the output file patterns.txt. At the same time, the pattern spectrum that has been determined from the surrogate data sets is written to the file spectrum.txt.

back to the top top

Program Options

The script ccn+psf+psr.py supports the following options
(a # after the option letter means that the option takes a parameter, option letters are case sensitive):

optionmeaningdefault
-T# target pattern type
i: item sets (item order is ignored)
c: partial permutations (without repetition)
m: sequences (with & without repetition)
i
-t# target pattern subtype for original data
f/a/s: all frequent patterns
c: closed (frequent) patterns
m: maximal (frequent) patternss
c
-u# target pattern subtype for surrogate data
f/a/s: all frequent patterns
c: closed (frequent) patterns
m: maximal (frequent) item patterns
f
-w# width of time window/maximum distance 0.003
-s# minimum support of an item set 2
-m# minimum number of items per item set 1
-n# maximum number of items per item set no limit
-a# start of point/time range auto
-z# end of point/time range auto
-x do not prune with perfect extensions prune
-g# surrogate generation method
x: none (read pattern spectrum from file)
i: identity (keep original data)
r: point/spike time randomization
d: point/spike dithering/displacement
s: train shifting/dithering
k: sampling from a kernel estimate
p: dithered item permutation
b: dithered blocked item permutation
e: estimate pattern spectrum (no surrogates)
e
-c# number of surrogate data sets
(if ≤, the pattern spectrum is read from a file)
1,000
-d# random function type for displacements
u/r: uniform/rectangular density
t: symmetric triangular density
g/n: Gaussian/normal density
r
-p# parameter for random function density
(standard deviation or half base width of rectangle/triangle)
0.005
-B# block size for blocked permutations
(only used with option -gb)
0.03
-Z# number of cpus/processor cores to use
(a value <= 0 means all cpus reported as available)
0
-e# item probability dispersion factor
(only used with option -ge)
0.5
-i# number of samples to be drawn per item set size
(only used with option -ge)
1,000
-R# pattern set reduction method
x: none (keep all patterns after filtering)
c: excess coincidences (zb,cb-ca)
C: excess coincidences (zb,cb-ca+1)
i: excess items/neurons (za-zb+2,ca)
s: covered points/spikes za*ca : zb*cb
S: covered points/spikes (za-1)*ca : (zb-1)*cb
l: combined lenient (C+i+s, break only rejection tie)
L: combined lenient (C+i+S, break only rejection tie)
t: combined strict (C+i+s, always force decision)
T: combined strict (C+i+S, always force decision)
L
-S# seed for pseudo-random numbers 0 (time)
-P# name of pattern spectrum file none
-q# column separator for pattern spectrum " "
-h# record header for output ""
-k# item separator for output " "
-v# output format for pattern support " (%d)"
-l# one train per record (item and points) id/time pairs
-y# no item (with -l) or item after point item first
-r# record separators "\n"
-f# field separators " \t,"
-b# blank characters " \t\r"
-C# comment characters "#"

This list of options is also printed if the script/program ccn+psf+psr.py is called without arguments.

Note that in the arguments to the options -r, -f, -b and -C ASCII escape sequences (like "\n" or "\t") are recognized.

With the options -a# and -z# a (spike) time range may be specified, with which the input data is filtered. That is, only points/spikes with occurrence times in the specified time range are used in the analysis. The spike time range is also used to clamp the points/spikes in surrogate data sets that are generated to obtain a pattern spectrum. That is, no spikes/event times will be generated that lie outside of this time range. Technically, this is achieved by wrapping round spike/event times before the start of the range to its end and spike/event times after the end of the range to its start.

If the options -a# and -z# are not given, the spike time range is determined automatically from the data: the lower bound of the range is the earliest point/event/spike time rounded to the next lower integer (unless an integer itself), and the upper bound of the range is the latest point/event/spike time rounded to the next higher integer (unless an integer itself).

A description of the surrogate data generation methods (option -g#) can be found in this section. Note that if the option -gx is given, that is, if no surrogate data generation method is selected, it is mandatory to provide the option -P# as well. This option then specifies the file from which a pattern spectrum is to be read. The format of a pattern spectrum file is described in this section.

back to the top top

Pattern Spectrum Filtering

The PSF and PSR parameters start with the number of surrogate data sets to generate. According to [Picado-Muiño et al. 2013] this number should be computed as an estimate of the number of pattern signatures that occur in the original data and one is actually willing to consider as significant, divided by the desired significance level for the patterns (thus implementing a simple Bonferroni correction for multiple testing). For example, if there are 30 pattern signatures that one is actually willing to consider as possible patterns and the desired significance level is 1% = 0.01, then 30/0.01 = 3,000 surrogate data sets should be generated. However, for most practical purposes, 1,000 data sets are usually sufficient, since the decision border does not change much with the number of surrogate data sets, provided it is not too small.

For generating surrogate data sets, five different methods are available, plus two special options for reading a pattern spectrum from a file ("none (read pattern spectrum from file)", option -gx, where the file name has to be specified with the option -P#) and keeping the original data unchanged ("identity", option -gi). Since generating multiple surrogate data sets with an identity mapping is obviously useless (as the exact same set of patterns will be generated every time), the number of surrogate data sets is ignored for this option, and only one surrogate data set is generated (which, due to the identity mapping, is identical to the original data). This option mainly serves the purpose to allow a user to generate and store the pattern spectrum of the original data.

Most of the actual surrogate data generation methods require a probability density function, from which random numbers are sampled, and a parameter that governs the width of this density function ("sigma parameter", option -p#). The density function can be "rectangular" (or uniform; option -dr or -du), "triangular" (option -dt) or "Gaussian/normal" (option -dg or -dn). For the first two choices the sigma parameter states half the base width of the rectangular or triangular density function, in the same (time) units that are used in the (spike) train data file and in which the time window width is specified with the option -w# as well. For a Gaussian/normal density function, the sigma parameter states the standard deviation (which explains the name "sigma parameter", since in statistics the standard deviation is usually denoted by σ — the Greek character sigma).

The surrogate data generation methods are:

Note that the density function type and sigma parameter are ignored for the first surrogate data generation method, because this method always samples from a uniform distribution on the (time) range of the points/spikes.

Note also that the last three methods can, in principle, be applied with a sigma parameter of zero (that is, the chosen points/spikes are not displaced/dithered), while the second and third method require a non-zero sigma parameter to actually modify the data. The reason is that the last three methods not only displace/dither points/spikes in a train, but also permute points/spikes between trains. This has the advantage that possibly present synchronous activity is more thoroughly destroyed. However, the fourth and the fifth method have the disadvantage that rate profiles of individual trains may also be changed (only the overall rate profile — sum over all trains — is preserved). Only the last method preserves (up to the precision of the block size) the rate profiles of individual trains.

The execution of all surrogate generation methods is parallelized internally. That is, as many surrogate data sets are generated and analyzed in parallel as the computer on which the script/program ccn+psf+psr.py is running reports as available processor cores. This makes the pattern spectrum generation much faster, but can render the computer unresponsive and slow as long as the analysis is running, since all available processing power of the machine is utilized. To avoid this, the number of cpus/processor cores to be used can be set explicitly with the option -Z#. If a value ≤ is given with this option, all cpus/processor cores are used that are reported as available by the system.

In order to make the surrogate data generation reproducible, a seed for the pseudo-random number generator can be specified. An empty seed or a seed value of zero mean that the current time (as a Unix time stamp) is used as a seed (and thus renders the surrogate data generation essentially irreproducible). Note, however, that reproducible results can only be obtained on computers that offer the same number of processor cores. A different number of processor cores usually means that a different set of surrogate data sets is generated, due to the different split into parallel execution threads.

As an alternative to generating and analyzing surrogate data sets, a pattern spectrum may be obtained by estimating it from the characteristics of the (spike) trains:

The pattern spectrum estimation essentially assesses the expected counts for the pattern signatures for each pattern size separately. For a given pattern size, the number of "slots" in the data are counted (that is, the number of event groups of the given size that fall within the specified window width/are no farther apart than this maximum distance) that can hold a pattern of this size. Then a distribution over the coincidences/support values is estimated as an average of Poisson distributions, where the distribution parameter is computed from the number of slots and estimates of the probability of specific item/event type sets. This probability is computed from the occurrence probabilities of the individual items/event types. However, it turns out that using the item/event type probability as it can be derived from the data leads to an overestimate of the expected number of coincidences. As a heuristic correction for this, the dispersion of the item probabilities is multiplied by a factor (less than 1) in order to make the probabilities more similar than they actually are. Empirically its was found that a factor around 0.4 to 0.5 leads to good results. This factor can be specified with the option -e#. Furthermore, since different items/events occur with different probabilities and it is (due to a combinatorial explosion) impossible to enumerate all possibilities, samples of concrete item/event type sets are drawn. How many samples are to be drawn per item set size (and thus how many Poisson distributions are to be averaged for an estimate of the distribution over the coincidences/support values) can be specified with the option -i#. Since the sample drawing is also a random process, which one may want to make reproducible, a seed for the random number generator may be specified (see the explanations as they were given above for the surrogate data generation). Note that the number of surrogates is also used for the pattern spectrum estimation, namely as an equivalent number which determines how the occurrence counts for the signatures are rounded and thresholded.

After a pattern spectrum has been obtained from analyzing surrogate data sets, it may be written to the pattern spectrum file that is specified with the option -P# (if this option has be given) and the patterns found in the original data are filtered with it. That is, only patterns with signatures that do not occur in the surrogate data sets/in the (estimated) pattern spectrum are kept, where a pattern signature is a pair of pattern size (number of items/neurons) and pattern support (occurrence frequency/number of coincidences in a maximum independent set sense). All other patterns are discarded.

back to the top top

Pattern Set Reduction

The patterns remaining after pattern spectrum filtering (PSF) are then further reduced by pattern set reduction (PSR), which analyzes subset and superset relationships between the found patterns. The reason for this step is that due to chance coincidences of subsets of items/neurons of an actually significant pattern or a few chance coincidences with additional items/neurons outside of the actually significant patterns, the filtered pattern set may still contain non-significant induced patterns. These patterns are removed based on the following rationale: between pairs of patterns, one of which is a subset of the other, a (possibly incomplete) preference relation is defined. Based on this preference relation, only those patterns are kept to which no other pattern (subset or superset pattern) is preferred.

In the following description of the pattern set reduction methods (or rather the preference relations underlying them), A and B refer to two patterns (sets of items/neurons) with B ⊆ A (that is, B is a subset — or subpattern — of A), z refers to the size of the pattern (that is, the number of items/neurons contained in it), and c refers to the support of the pattern (that is, the number of coincidences/synchronous events/parallel episodes in a maximum independent set sense exhibited by the pattern).

The pattern set reduction methods/preference relations are:

After the patterns have been reduced with pattern set reduction (PSR), the final set of patterns is written to the pattern output file specified on the "Files" tab.

back to the top top

Data Format

The following sections describe the data format of the three file types that the script ccn+psf+psr.py reads or writes: the (spike) trains file, the pattern spectrum file and the pattern output file.

back to the top top

(Spike) Trains File

The (spike) trains file has to be a (text) file structured by field and record separators and blanks. Record separators, not surprisingly, separate records, usually lines (since the default record separator is the newline character), field separators separate fields, usually words (since among the default field separators are the space and the tabulator, but also the comma). Blanks are used to fill fields, for example, to align them. In addition, comment characters are recognized. If a record starts with a character that is among those declared as comment characters, the record is considered to be a comment and therefore ignored. Record separators are specified with the option -r, field separators with -f, blank characters with -b and comment characters with -C. In the strings passed as arguments to these options standard ASCII escape sequences, like "\n" or "\t", are recognized. Furthermore, special characters can be specified with "\000" or "\x00", where the zeros have to be replaced with the octal or hexadecimal code of the character, respectively.

There are four different record formats, which can be selected with the two option -l and -y:

Note that in all of the above formats, commas or tabulator characters may also be used as the field separator without having to change any settings, since the comma and the tabulator character are also among the default field separators.

back to the top top

Pattern Spectrum File

A pattern spectrum maps pattern signatures, that is, a pair of the size of a pattern (number of items/neurons) and its support (number of (co-)occurrences in the maximum independent set sense) to occurrence counts, that is, to the (average) number of patterns with such a signature that have been observed in the surrogate data sets.

The name of the pattern spectrum file can be specified with the option -P (by default, no pattern spectrum file is written). If a pattern spectrum file is written, it contains three columns: the first column contains the pattern size, the second column the pattern support and the third column the (average) occurrence count. An example pattern spectrum file starts like this:
2 2 27.35
2 3 82.748
2 4 188.141
2 5 338.811
2 6 504.222
2 7 632.721
2 8 693.565
2 9 666.21
2 10 574.587
2 11 445.557
2 12 316.625
...

The fields in this file are separated by spaces and the records are separated by newline characters (making them the lines of a simple text file).

The size and the support are always written as integer numbers, the (average) occurrence count as an integer or a floating point number. Note that a pattern spectrum file that has been created outside of the script ccn+psf+psr.py and that states the size and/or the support as floating point numbers may cause problems with the pattern spectrum viewer. The script ccn+psf+psr.py writes the triplets of (size, support, count) sorted by size, and within the same size sorted by support.

back to the top top

Pattern Output File

The pattern output file lists the found patterns (after pattern spectrum filtering and pattern set reduction), one pattern per record. The records of this file are separated by newline characters, making them the rows of a simple text file. Each record contains one found pattern, described as a list of items/neurons and a support indicator. The items/neurons are separated by spaces.

back to the top top

References

back to the top top

Copying

(MIT license, or more precisely Expat License; to be found in the file mit-license.txt in the directory pycoco/doc in the source package of the program, see also opensource.org and wikipedia.org)

© 2013-2016 Christian Borgelt

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

back to the top top

Download

Download page with most recent version.

back to the top top

Contact

E-mail: christian@borgelt.net
Website: www.borgelt.net
back to the top top

© 2013-2016 Christian Borgelt