CoCoNAD+PSF+PSR

Continuous Time Closed Neuron Assembly Detection
with Pattern Spectrum Filtering and Pattern Set Reduction

Introduction
Program Invocation
Program Options
Pattern Spectrum Filtering
Pattern Set Reduction
Data Formats
References
Copying
Download
Contact

Introduction

CoCoNAD (for Continuous Time Closed Neuron Assembly Detection) is an algorithm for finding frequent parallel episodes or frequent (approximately) synchronous events in parallel point processes, for example, in parallel spike trains as they are recorded with multi-electrode devices in neurobiology. The script/program ccn+psf+psr.py is a command line interface to CoCoNAD, which is written in Python and accesses a C-based implementation of the CoCoNAD algorithm through a Python extension module. It also comprises the methods of Pattern Spectrum Filtering (PSF) and Pattern Set Reduction (PSR), with which the abundance of frequent patterns that are usually found in a given data set can be reduced to a smaller set of significant patterns.

If instead of a command line interface (as ccn+psf+psr.py provides it) a graphical user interface for the CoCoNAD method as well as pattern spectrum filtering (PSF) and pattern set reduction (PSR) is desired, the Java program CoCoGUI that is made available on the download page for the CoCoGUI program is worth looking at. This user interface accesses the same C implementation of the CoCoNAD algorithm through a library that employs the Java Native Interface (JNI) to make C-based functions available in Java. It also comprises the methods of Pattern Spectrum Filtering (PSF) and Pattern Set Reduction (PSR) as well as viewers for parallel (spike) train data, pattern spectra and pattern sets.

General information about the theory underlying the CoCoNAD algorithm can be found in [Picado-Muiño and Borgelt 2013], while implementation aspects are discussed in [Borgelt and Picado-Muiño 2013]. The method of pattern spectrum filtering was introduced in [Picado-Muiño et al. 2013] (although for time-binned data, while CoCoNAD works on continuous data, which requires some adaptations) and was discussed in more detail as well as extended by methods for pattern set reduction in [Torre et al. 2013] (again for time-binned data, but the core ideas can be transferred to the continuous time operation of CoCoNAD). It is recommended to study these papers before applying the script/program ccn+psf+psr.py to your own data. The following is only a very brief summary/review of some fundamental principles.

The core idea of CoCoNAD is to define a set of (approximately) synchronous events or a parallel episode in parallel (spike) trains or point processes as a group of events (action potentials or spikes in the neurobiology application) that occur with a user-specified maximum distance (in time) from each other. CoCoNAD then looks for frequent synchronous events or frequent parallel episodes, where the frequency or the support of a group of items (event types, or neurons in the neurobiological application) is measured by a maximum independent set approach: each set of items/events (spikes) of the same types (or: each parallel episode for the same items/event types, each set of approximately synchronous spikes from the same set of neurons) is an element of a family of sets. The size of a maximum independent set of this family, that is, of a selection of sets that have only empty pairwise intersections (independent set) and which comprises as many sets as possible (maximum independent set), is the support of the set of items/neurons. From a neurobiological point of view, this is a very plausible and intuitive support measure, since it ensures that no single event (spike) is counted more than once for the support of a given set of items/neurons. From a computer science/data analysis point of view, this support measure has the advantage that it is anti-monotone (or downward closed) and thus allows for effective and efficient pruning of the search for frequent synchronous events/parallel episodes.

The core idea underlying pattern spectrum filtering (PSF) is that among the abundance of frequent patterns found with CoCoNAD most are chance occurrences. That is, the majority of these patterns are such that they are likely to be observed even if there is no regularity in the (co-)occurrence of the events (that is, if the spike trains are independent or if events of different types occur independently). To filter out these chance patterns and thus reduce the patterns to the (likely) significant ones, surrogate data sets are generated by randomizing and permuting the original data, with the aim of destroying any (regular) synchronous activity, but keeping as many other properties of the data as possible (like the number of points per train/spikes per neuron, local event/spike densities etc). These surrogate data sets implicitly represent the null hypothesis of independent point processes/(spike) trains. By generating and analyzing a sufficiently large number of surrogate data sets, a pattern spectrum is generated. A pattern spectrum maps pattern signatures, that is, pairs of the size of a pattern (its number of items/neurons) and its support (number of (co-)occurrences in the maximum independent set sense), to occurrence counts, that is, to the (average) number of patterns with such a signature that have been observed in the surrogate data sets. It is then assumed that any patterns found in the original data that possess signatures that occur also in at least one surrogate data set can safely be considered chance events (since a counterpart — same signature — occurred in presumably independent data) and thus can be removed from the patterns found in the original data.

As a faster alternative to generating a pattern spectrum by analyzing a large number of surrogate data sets, an estimation method is provided, which evaluates the characteristics of the data to produce an approximation of a pattern spectrum. Although this method yields only an approximation of an actual pattern spectrum, it has the advantage that it produces a pattern spectrum orders of magnitude faster and thus is the method of choice for larger data sets (more event types/neurons, longer recording periods) and for larger maximally allowed distances between events that are to be considered synchronous. Due to this speed advantage, estimating a pattern spectrum has been made the default operation mode.

The patterns that remain after pattern spectrum filtering are then further reduced by analyzing subset and superset relationships between them, since chance coincidences of subsets of found patterns and chance coincidences of the whole set with items/neurons that are not part of an actual pattern can lead to smaller patterns with higher support and larger patterns with lower support, respectively, which are not significant themselves, but only induced by an actually significant pattern. The core principle of the pattern set reduction (PSR) step is to define a preference relation between patterns of which one is a subset of the other and then to keep only patterns to which no other pattern is preferred.

back to the top

Program Invocation

The Python script/program ccn+psf+psr.py is distributed together with several other (supporting) scripts in the archives psf+psr.zip or psf+psr.tar.gz (same contents, choose according to your preference) that are available on the download page for the PyCoCo library. One of these archives needs to be unpacked to some directory to get access to the main script/program ccn+psf+psr.py and its supporting scripts.

It is highly recommended to install the PyCoCo library as well (a Python extension module, which makes a fast C-based implementation of the CoCoNAD algorithm as well as an efficient parallelized version of pattern spectrum generation available in Python). The Python scripts will also work without this library, but at a high price w.r.t. the execution time. Using the pure Python scripts can be slower by a factor of 40 or even more in certain cases.

Apart from the main script/program ccn+psf+psr.py there are several supporting scripts, some of which are always needed, while other are only fallen back on if the PyCoCo library is not installed. The full set of scripts comprises:

ccn+psf+psr.py
The main script/program that implements the whole process of pattern spectrum filtering (PSF) and pattern set reduction (PSR) and which is described in detail on this web page.
coconad.py
A Python implementation of the CoCoNAD algorithm, which is fallen back on if the PyCoCo library is not found (otherwise it is not needed). This script/program can also be called directly on the command line to execute a pure CoCoNAD run (with neither pattern spectrum filtering nor pattern set reduction). However, if this is desired, it is highly preferable to use the C-based command line implementation available on this page as it is much faster. Only if a graded notion of synchrony is desired, the Python script/program needs to be used as the C implementation does not yet support this possibility (neither does the script/program ccn+psf+psr.py).
surrogates.py
A collection of functions to generate surrogate data sets, which are used (indirectly, through patspec.py) by the script/program ccn+psf+psr.py if the PyCoCo library is not installed. Otherwise the surrogate data generation methods that are built into the PyCoCo library (and which parallel the ones implemented in this script) are relied on. What surrogate data generation methods are available is described in this section.
patspec.py
Some utility functions for working with pattern spectra, including the pattern spectrum generation based on the Python implementation of the CoCoNAD algorithm as it is provided by the script/program coconad.py. While the pattern spectrum generation is drawn on only if the PyCoCo library is not installed, some other utility functions (for example, for reading and writing pattern spectra) are always needed.
patred.py
A collection of functions for pattern set reduction, which are called by the script/program ccn+psf+psr.py (regardless of whether the PyCoCo library is installed or not). Which pattern set reduction methods are available is described in this section.
psts.py
A script/program to generate parallel spike trains without and with injected coincidences/synchronous activity. This program is not needed for the operation of the script/program ccn+psf+psr.py (no functions from this script are called), but it can be useful for testing the script/program ccn+psf+psr.py as it provides simple means of generating test data sets.

The general syntax of the program invocation is

[python] ccn+psf+psr.py [options] infile [outfile]

Here [python] stands for the name of the Python interpreter that is to execute the script/program ccn+psf+psr.py. The script/program should work with both Python 2.7.x and Python 3.x. (Note, however, that the PyCoCo library needs to be compiled specifically for the Python version to be used. On the download page for the PyCoCo library precompiled version are available only for Python 2.7.x.)

On a GNU/Linux system, the script/program ccn+psf+psr.py may also be called directly (provided the executable flag of the file is set, otherwise execute the command "chmod +x ccn+psf+psr.py"), because the first line of this script/program indicates that /usr/bin/python should be used as the interpreter. This is the reason why the Python interpreter is indicated as optional (by the enclosing brackets). In the following, example command lines will be given in this format, that is, without an explicit statement of the Python interpreter. However, if the script is called directly, it may have to be specified as ./ccn+psf+psr.py, unless the current directory is on your PATH environment variable. (Note that this is usually not the case by default, due to computer security concerns.)

On Microsoft Windows, however, it is mandatory to specify this interpreter, since there is no way to specify the Python interpreter in the script itself in Microsoft Windows (I may be mistaken about this; if you know how to do it, please let me know.) Whether the interpreter can be stated merely as python or whether a full path to the Python executable has to be given depends on whether the path to the directory in which the Python interpreter resides in on the PATH environment variable or not. As this is system specific, no general statement can be made.

The first argument infile, which is mandatory, is the name of a file that contains the point processes/spike trains to analyze. The format of this file is described in this section.

The second argument outfile, which is optional (as indicated by the brackets), is the name of a file to which the found patterns (after pattern spectrum filtering and pattern set reduction) are to be written. That it is optional may be useful for benchmark tests, although after pattern spectrum filtering and pattern set reduction usually only few patterns remain, so that the time it takes to write the output to a file is likely too small to conceal the actual search time. However, it may be useful if only a pattern spectrum (that is, a mapping from pattern signature to occurrence counts; option -P#) is to be found. The format in which found patterns are written to the output file is described in this section.

In addition to the input and output file(s) several options can be given, all of which consist of a minus sign and a single letter. The full list of options can be found in the next section.

Some options take a parameter. For example, the option -s#, with which the minimum support is specified, takes a number as a parameter, which must follow the letter s without any separating space. The option -P#, which indicates that a pattern spectrum file is to be written, requires the name of the file as an argument. Like for the option -s# this file name must follow the option character P directly, that is, without a separating space. The format of the pattern spectrum file is described in this section.

Options may be specified anywhere on the command line, that is, before the input file name, in between the input and output file names, or after the output file name.

If an option is given more than once, the last statement counts. That is, with

ccn+psf+psr.py -s2 -s3 input output

the minimum support is 3, as the -s2 is overwritten by the following -s3.

To test the operation of the script/program ccn+psf+psr.py, it is recommended to retrieve the example file trains.txt from the download page. If this file is placed into the same directory from which the script/program ccn+psf+psr.py was started, it can be used directly as the (spike) trains input data file. For example,

ccn+psf+psr.py trains.txt -Pspectrum.txt patterns.txt

processes this file with the default options. This should take between 20 and 60 seconds on a reasonably recent computer and should finally write three patterns to the output file patterns.txt. At the same time, the pattern spectrum that has been determined from the surrogate data sets is written to the file spectrum.txt.

back to the top

Program Options

The script ccn+psf+psr.py supports the following options
(a # after the option letter means that the option takes a parameter, option letters are case sensitive):

option	meaning	default
`-T#`	target pattern type `i`: item sets (item order is ignored) `c`: partial permutations (without repetition) `m`: sequences (with & without repetition)	`i`
`-t#`	target pattern subtype for original data `f/a/s`: all frequent patterns `c`: closed (frequent) patterns `m`: maximal (frequent) patternss	`c`
`-u#`	target pattern subtype for surrogate data `f/a/s`: all frequent patterns `c`: closed (frequent) patterns `m`: maximal (frequent) item patterns	`f`
`-w#`	width of time window/maximum distance	0.003
`-s#`	minimum support of an item set	2
`-m#`	minimum number of items per item set	1
`-n#`	maximum number of items per item set	no limit
`-a#`	start of point/time range	auto
`-z#`	end of point/time range	auto
`-x`	do not prune with perfect extensions	prune
`-g#`	surrogate generation method `x`: none (read pattern spectrum from file) `i`: identity (keep original data) `r`: point/spike time randomization `d`: point/spike dithering/displacement `s`: train shifting/dithering `k`: sampling from a kernel estimate `p`: dithered item permutation `b`: dithered blocked item permutation `e`: estimate pattern spectrum (no surrogates)	e
`-c#`	number of surrogate data sets (if ≤, the pattern spectrum is read from a file)	1,000
`-d#`	random function type for displacements `u/r`: uniform/rectangular density `t`: symmetric triangular density `g/n`: Gaussian/normal density	r
`-p#`	parameter for random function density (standard deviation or half base width of rectangle/triangle)	0.005
`-B#`	block size for blocked permutations (only used with option `-gb`)	0.03
`-Z#`	number of cpus/processor cores to use (a value <= 0 means all cpus reported as available)	0
`-e#`	item probability dispersion factor (only used with option `-ge`)	0.5
`-i#`	number of samples to be drawn per item set size (only used with option `-ge`)	1,000
`-R#`	pattern set reduction method `x`: none (keep all patterns after filtering) `c`: excess coincidences (zb,cb-ca) `C`: excess coincidences (zb,cb-ca+1) `i`: excess items/neurons (za-zb+2,ca) `s`: covered points/spikes zaca : zbcb `S`: covered points/spikes (za-1)ca : (zb-1)cb `l`: combined lenient (C+i+s, break only rejection tie) `L`: combined lenient (C+i+S, break only rejection tie) `t`: combined strict (C+i+s, always force decision) `T`: combined strict (C+i+S, always force decision)	L
`-S#`	seed for pseudo-random numbers	0 (time)
`-P#`	name of pattern spectrum file	none
`-q#`	column separator for pattern spectrum	" "
`-h#`	record header for output	""
`-k#`	item separator for output	" "
`-v#`	output format for pattern support	" `(%d)`"
`-l#`	one train per record (item and points)	id/time pairs
`-y#`	no item (with -l) or item after point	item first
`-r#`	record separators	"`\n`"
`-f#`	field separators	" `\t,`"
`-b#`	blank characters	" `\t\r`"
`-C#`	comment characters	"`#`"

This list of options is also printed if the script/program ccn+psf+psr.py is called without arguments.

Note that in the arguments to the options -r, -f, -b and -C ASCII escape sequences (like "\n" or "\t") are recognized.

With the options -a# and -z# a (spike) time range may be specified, with which the input data is filtered. That is, only points/spikes with occurrence times in the specified time range are used in the analysis. The spike time range is also used to clamp the points/spikes in surrogate data sets that are generated to obtain a pattern spectrum. That is, no spikes/event times will be generated that lie outside of this time range. Technically, this is achieved by wrapping round spike/event times before the start of the range to its end and spike/event times after the end of the range to its start.

If the options -a# and -z# are not given, the spike time range is determined automatically from the data: the lower bound of the range is the earliest point/event/spike time rounded to the next lower integer (unless an integer itself), and the upper bound of the range is the latest point/event/spike time rounded to the next higher integer (unless an integer itself).

A description of the surrogate data generation methods (option -g#) can be found in this section. Note that if the option -gx is given, that is, if no surrogate data generation method is selected, it is mandatory to provide the option -P# as well. This option then specifies the file from which a pattern spectrum is to be read. The format of a pattern spectrum file is described in this section.

back to the top

Pattern Spectrum Filtering

The PSF and PSR parameters start with the number of surrogate data sets to generate. According to [Picado-Muiño et al. 2013] this number should be computed as an estimate of the number of pattern signatures that occur in the original data and one is actually willing to consider as significant, divided by the desired significance level for the patterns (thus implementing a simple Bonferroni correction for multiple testing). For example, if there are 30 pattern signatures that one is actually willing to consider as possible patterns and the desired significance level is 1% = 0.01, then 30/0.01 = 3,000 surrogate data sets should be generated. However, for most practical purposes, 1,000 data sets are usually sufficient, since the decision border does not change much with the number of surrogate data sets, provided it is not too small.

For generating surrogate data sets, five different methods are available, plus two special options for reading a pattern spectrum from a file ("none (read pattern spectrum from file)", option -gx, where the file name has to be specified with the option -P#) and keeping the original data unchanged ("identity", option -gi). Since generating multiple surrogate data sets with an identity mapping is obviously useless (as the exact same set of patterns will be generated every time), the number of surrogate data sets is ignored for this option, and only one surrogate data set is generated (which, due to the identity mapping, is identical to the original data). This option mainly serves the purpose to allow a user to generate and store the pattern spectrum of the original data.

Most of the actual surrogate data generation methods require a probability density function, from which random numbers are sampled, and a parameter that governs the width of this density function ("sigma parameter", option -p#). The density function can be "rectangular" (or uniform; option -dr or -du), "triangular" (option -dt) or "Gaussian/normal" (option -dg or -dn). For the first two choices the sigma parameter states half the base width of the rectangular or triangular density function, in the same (time) units that are used in the (spike) train data file and in which the time window width is specified with the option -w# as well. For a Gaussian/normal density function, the sigma parameter states the standard deviation (which explains the name "sigma parameter", since in statistics the standard deviation is usually denoted by σ — the Greek character sigma).

The surrogate data generation methods are:

point/spike time randomization
Each point/spike is randomly displaced within the (time) range of the (spike) train data. This method preserves the number of points/spikes per train, but flattens the rate profile, that is, as a result all surrogate trains are equivalent to samples from Poisson processes. Therefore this method should only be applied to data that already has a flat/stationary rate profile. Otherwise false discoveries are to be expected, since samples from Poisson processes usually lead to (chance) patterns that are smaller and have lower support than samples from processes with a non-stationary rate profile. As a consequence, chance patterns found in the original data that are somewhat larger or have a somewhat larger support (resulting from the non-stationary rate profile) may be wrongly classified as actual patterns.
point/spike dithering/displacement
For each point/spike a random displacement is sampled from the probability density function that is specified with the density function type and sigma parameter. This displacement is then added to the point/spike time. This method preserves the number of points/spikes per train as well as the (overall and individual) point/spike rates, provided the sigma parameter of the density function is not chosen too large. Dithering generally introduces a certain tendency to flatten the rate profile, but this is usually acceptable if the sigma parameter is chosen sufficiently small, in order to minimize this tendency. A recommended value is about twice the analysis window width, as it is specified with the option -w#, in order to properly destroy possibly present synchronous activity, without flattening the rate profile too much.
(spike) train shifting/dithering
For each (spike) train a random displacement is sampled from the probability density function that is specified with the density function type and sigma parameter. This displacement is then added to all points/spike times in the train. That is, the train as a whole is shifted by a random amount (while in the preceding method each point/spike is shifted randomly by its own individual amount). This method preserves the number of points/spikes per train as well as the (overall and individual) point/spike rates and even the inter-point/inter-spike intervals (in each of the trains, though not between different trains).
sampling from a kernel estimate
A kernel estimate is computed from all points/spikes in all trains by placing probability density functions as specified with the density function type and sigma parameter at each point/spike and summing them with a weight of 1 over the total number of points/spikes each. Then for each train as many points/spikes as it contains are sampled from this kernel estimate. Alternatively, this method can be described as sampling for each train as many points/spikes as it contains from the pool of all points/spikes (of all trains, with replacement). These new points/spikes are then dithered as in the method "point/spike dithering/displacement". This method maintains the number of points/spikes per train as well as the overall point/spike rates over time (but not necessarily the individual rates).
dithered item permutation
For each train as many points/spikes as it contains are sampled from the pool of all points/spikes (of all trains, without replacement). In addition, a random displacement, sampled from a probability density function that is specified with the density function type and sigma parameter, is added to each sampled point/spike time (like in the method "point/spike dithering/displacement"). Note that the difference between this method and the preceding one is that in this method the points/spikes are sampled without replacement, even between trains (that is, each point/spike is used exactly once), while in the preceding method points/spikes are sampled with replacement (that is, some points/spikes may be used multiple times, while others may not be used at all). This method may also be seen as randomly permuting the items/neurons over the points/spikes plus applying time dithering. This method maintains the number of points/spikes per train as well as the overall point/spike rates over time (but not necessarily the individual rates).
dithered blocked item permutation
Instead of pooling all points/spikes of all trains in the whole recording period (as in the preceding method), the points/spikes of all trains are divided into blocks based on the block size parameter stated on the tab (this input is enabled only for this surrogate data generation method). The block size parameter specifies the length of the time intervals into which the whole recording time is split. The blocks are computed as the intervals [t₀ +k*δ, t₀ +(k+1)*δ), where t₀ is the start time (as specified on the "Files" tab or determined automatically), &delta is the block size (to be specified in the same (time) units that are used in the (spike) train data file) and k is a non-negative integer. Then the points/spikes are permuted as described for the preceding method (that is, by sampling without replacement from the pool of points/spikes in each block) and finally dithered (like in the method "point/spike dithering/displacement"). This method preserves the number of points/spikes per train as well as the (overall and individual) point/spike rates (individual only up to the precision of the block size).

Note that the density function type and sigma parameter are ignored for the first surrogate data generation method, because this method always samples from a uniform distribution on the (time) range of the points/spikes.

Note also that the last three methods can, in principle, be applied with a sigma parameter of zero (that is, the chosen points/spikes are not displaced/dithered), while the second and third method require a non-zero sigma parameter to actually modify the data. The reason is that the last three methods not only displace/dither points/spikes in a train, but also permute points/spikes between trains. This has the advantage that possibly present synchronous activity is more thoroughly destroyed. However, the fourth and the fifth method have the disadvantage that rate profiles of individual trains may also be changed (only the overall rate profile — sum over all trains — is preserved). Only the last method preserves (up to the precision of the block size) the rate profiles of individual trains.

The execution of all surrogate generation methods is parallelized internally. That is, as many surrogate data sets are generated and analyzed in parallel as the computer on which the script/program ccn+psf+psr.py is running reports as available processor cores. This makes the pattern spectrum generation much faster, but can render the computer unresponsive and slow as long as the analysis is running, since all available processing power of the machine is utilized. To avoid this, the number of cpus/processor cores to be used can be set explicitly with the option -Z#. If a value ≤ is given with this option, all cpus/processor cores are used that are reported as available by the system.

In order to make the surrogate data generation reproducible, a seed for the pseudo-random number generator can be specified. An empty seed or a seed value of zero mean that the current time (as a Unix time stamp) is used as a seed (and thus renders the surrogate data generation essentially irreproducible). Note, however, that reproducible results can only be obtained on computers that offer the same number of processor cores. A different number of processor cores usually means that a different set of surrogate data sets is generated, due to the different split into parallel execution threads.

As an alternative to generating and analyzing surrogate data sets, a pattern spectrum may be obtained by estimating it from the characteristics of the (spike) trains:

estimate pattern spectrum (no surrogates)
Pattern spectrum estimation can be selected by choosing the last entry in the list of surrogate data generation methods. This approach has the huge advantage that the analysis is much faster, since estimating a pattern spectrum only takes a fraction of a second, while generating and analyzing surrogate data sets can be computationally very demanding, especially for many event types/neurons or long recording periods and if a large window width/maximum distance between (approximately synchronous) events is chosen.

The pattern spectrum estimation essentially assesses the expected counts for the pattern signatures for each pattern size separately. For a given pattern size, the number of "slots" in the data are counted (that is, the number of event groups of the given size that fall within the specified window width/are no farther apart than this maximum distance) that can hold a pattern of this size. Then a distribution over the coincidences/support values is estimated as an average of Poisson distributions, where the distribution parameter is computed from the number of slots and estimates of the probability of specific item/event type sets. This probability is computed from the occurrence probabilities of the individual items/event types. However, it turns out that using the item/event type probability as it can be derived from the data leads to an overestimate of the expected number of coincidences. As a heuristic correction for this, the dispersion of the item probabilities is multiplied by a factor (less than 1) in order to make the probabilities more similar than they actually are. Empirically its was found that a factor around 0.4 to 0.5 leads to good results. This factor can be specified with the option -e#. Furthermore, since different items/events occur with different probabilities and it is (due to a combinatorial explosion) impossible to enumerate all possibilities, samples of concrete item/event type sets are drawn. How many samples are to be drawn per item set size (and thus how many Poisson distributions are to be averaged for an estimate of the distribution over the coincidences/support values) can be specified with the option -i#. Since the sample drawing is also a random process, which one may want to make reproducible, a seed for the random number generator may be specified (see the explanations as they were given above for the surrogate data generation). Note that the number of surrogates is also used for the pattern spectrum estimation, namely as an equivalent number which determines how the occurrence counts for the signatures are rounded and thresholded.

After a pattern spectrum has been obtained from analyzing surrogate data sets, it may be written to the pattern spectrum file that is specified with the option -P# (if this option has be given) and the patterns found in the original data are filtered with it. That is, only patterns with signatures that do not occur in the surrogate data sets/in the (estimated) pattern spectrum are kept, where a pattern signature is a pair of pattern size (number of items/neurons) and pattern support (occurrence frequency/number of coincidences in a maximum independent set sense). All other patterns are discarded.

back to the top

Pattern Set Reduction

The patterns remaining after pattern spectrum filtering (PSF) are then further reduced by pattern set reduction (PSR), which analyzes subset and superset relationships between the found patterns. The reason for this step is that due to chance coincidences of subsets of items/neurons of an actually significant pattern or a few chance coincidences with additional items/neurons outside of the actually significant patterns, the filtered pattern set may still contain non-significant induced patterns. These patterns are removed based on the following rationale: between pairs of patterns, one of which is a subset of the other, a (possibly incomplete) preference relation is defined. Based on this preference relation, only those patterns are kept to which no other pattern (subset or superset pattern) is preferred.

In the following description of the pattern set reduction methods (or rather the preference relations underlying them), A and B refer to two patterns (sets of items/neurons) with B ⊆ A (that is, B is a subset — or subpattern — of A), z refers to the size of the pattern (that is, the number of items/neurons contained in it), and c refers to the support of the pattern (that is, the number of coincidences/synchronous events/parallel episodes in a maximum independent set sense exhibited by the pattern).

The pattern set reduction methods/preference relations are:

none (keep all patterns after filtering)
No preference relation is defined and all patterns that remain after pattern spectrum filtering (PSF) are kept. This method serves the purpose to be able to check the effect of pattern set reduction (PSR) as it allows to compare the pattern sets obtained with and without pattern set reduction.
excess coincidences (z_B, c_B-c_A)
One reason for the occurrence of induced patterns is that a subset of an actually significant pattern exhibits additional coincidences (higher support) that are mere chance events. In order to check whether additional coincidences are chance events, the pattern signature of only these additional coincidences is computed, which is the pair (z_B,c_B-c_A). If a pattern with this signature was found in the surrogate data sets, it is assumed that the excess coincidences are mere chance events and the larger pattern (that is, A) is preferred. Otherwise (that is, if the signature was not found in the surrogate data sets), the smaller pattern (that is, B) is preferred.
excess coincidences (z_B, c_B-c_A+1)
Using c_B-c_A in the pattern signature that is checked for occurrence in the patterns found in the surrogate data sets overestimates the chance for such a pattern to occur, because the total number of items/neurons is (essentially always) much larger than the number of items/neurons in the patterns under consideration. A strict treatment would generate new surrogate data sets involving only z_A items/neurons (see [Torre et al. 2013]). As a heuristic correction, one coincidence is added, which makes it possible to use the results obtained from the original surrogate data sets. This is a slight over-correction, but it works fairly well in practice.
excess items/neurons (z_A-z_B+2, c_A)
Apart from one or more subsets exhibiting excess coincidences, additional items/neurons may produce an event/spike close to a few of the coincidences of an actual pattern, thus inducing a pattern with excess items/neurons (lower support, but larger size). In order to check whether these additional items/neurons are chance events, the signature (z_A-z_B+2, c_A) is considered. The reason for the +2 in the first part (that is, the size difference) is that using the size difference directly (severely) overestimates the chance for such a pattern to occur, since not only the excess items/neurons are involved in the coincidences. With time binned data, a strict treatment could generate new surrogate data sets using only c_A time bins (see [Torre et al. 2013]), and a corresponding time restriction would have to be introduced in the continuous time case considered with CoCoNAD. As a heuristic correction, two items/neurons are added, which makes it possible to use the results from the original surrogate data sets. This may be a slight over-correction, but it works fairly well in practice.
covered points/spikes z_A*c_A : z_B*c_B
The number of points/spikes that are covered by the sets in a maximum independent set of synchronous events/parallel episodes can be computed as the pattern size z (number of items/neurons) times the reported support c (number of coincidences). It is plausible that the occurrence of a pattern is the less likely the more points/spikes it covers, because more events have to occur in a coordinated fashion. Therefore of two patterns A and B (with B ⊆ A) the one that covers more points/spikes is preferred to the other. If the number of covered points/spikes is equal, the larger pattern (that is, the one containing more items/neurons) is preferred.
covered points/spikes (z_A-1)*c_A : (z_B-1)*c_B
A coincidence requires at least two events; a single event cannot be called "coincident", because there is no second event that it could be called coincident to. Therefore it is plausible not to consider the total number of points/spikes covered by a pattern, but only the "coincident" spikes. As a consequence, a preference relation is established by comparing the size of the patterns minus one (that is, z-1) times the reported support c and preferring the pattern that yields the higher value. If the values are equal, the larger pattern (that is, the one containing more items/neurons) is preferred. This version of the covered spikes criterion may also be argued for by looking at the typical shape of the decision border resulting from pattern spectra. This border has a roughly hyperbolical shape, which, however, is (roughly) symmetric only if z-1 and c are considered.
combined lenient (z, break rejection tie)
The preference relations as defined by excess coincidence (in the second form, that is, with the heuristic correction of adding one coincidence to the difference) and excess items/neurons may also be combined to form a new preference relation. For this combination, the excess coincidences criterion is seen as testing whether the smaller pattern (that is, B) should be rejected (it is rejected, if a pattern with the constructed signature — see above — occurs in the surrogate data sets), while the excess items/neurons criterion is seen as testing whether the larger pattern (that is, A) should be rejected (it is rejected, if a pattern with the constructed signature — see above — occurs in the surrogate data sets). (Here rejecting a pattern means that the pattern is considered an induced chance event.) If the two criteria agree (that is, if (1) excess coincidences rejects the smaller pattern and excess items/neurons does not reject the larger pattern, or (2) if excess items/neurons rejects the larger pattern and excess coincidences does not reject the smaller pattern), the pattern that is not rejected is preferred to the other. If excess coincidences rejects the smaller pattern and excess items/neurons rejects the larger pattern (rejection tie), the tie is broken by invoking the covered points/spikes criterion in its first form, that is, considering the full size of the patterns. If, however, excess coincidences does not reject the smaller pattern and excess items/neurons does not reject the larger pattern, no preference relation is established. That is, provided there is no other pattern that is referred to either of the two patterns, both may be kept in the final result. The possibility that both patterns may be kept is expressed by calling this criterion a "lenient" combination.
combined lenient (z-1, break rejection tie)
The second lenient combined criterion only replaces the criterion for breaking the rejection tie with the second covered points/spikes criterion, which uses the pattern sizes reduced by one (that is, z-1) times the support (number of coincidences).
combined strict (z, force decision)
While the lenient criteria that combine the excess coincidences and excess items/neurons criteria only break a rejection tie, the strict criteria break all ties. That is, if neither excess coincidences rejects the smaller pattern nor excess items/neurons rejects the larger pattern, a lenient combination does not define any preference relation, while a strict combination breaks this tie as well by invoking the covered points/spikes criterion. The first strict combined criterion uses the first form of the covered points/spikes criterion, that is, it considers the full size of the patterns times the support (number of coincidences).
combined strict (z-1, force decision)
The second strict combined criterion only replaces the criterion for breaking the rejection tie with the second covered points/spikes criterion, which uses the pattern sizes reduced by one (that is, z-1) times the support (number of coincidences).

After the patterns have been reduced with pattern set reduction (PSR), the final set of patterns is written to the pattern output file specified on the "Files" tab.

back to the top

Data Format

The following sections describe the data format of the three file types that the script ccn+psf+psr.py reads or writes: the (spike) trains file, the pattern spectrum file and the pattern output file.

back to the top

(Spike) Trains File

The (spike) trains file has to be a (text) file structured by field and record separators and blanks. Record separators, not surprisingly, separate records, usually lines (since the default record separator is the newline character), field separators separate fields, usually words (since among the default field separators are the space and the tabulator, but also the comma). Blanks are used to fill fields, for example, to align them. In addition, comment characters are recognized. If a record starts with a character that is among those declared as comment characters, the record is considered to be a comment and therefore ignored. Record separators are specified with the option -r, field separators with -f, blank characters with -b and comment characters with -C. In the strings passed as arguments to these options standard ASCII escape sequences, like "\n" or "\t", are recognized. Furthermore, special characters can be specified with "\000" or "\x00", where the zeros have to be replaced with the octal or hexadecimal code of the character, respectively.

There are four different record formats, which can be selected with the two option -l and -y:

records contain trains with item and points (option -l)
Each record has at least one field, which contains an item (name/identifier), followed by zero, one or more fields that contain points/spike times. The item (name/identifier) can be any string (provided it does not contain a field or record separator), each point/spike time must be a floating point number in the usual format (that is, a string satisfying the regular expression [+-]?[0-9]*[.]?[0-9]*[eE]?[+-]?[0-9]*).
For example, a file in this format may start like this:
a 0.257419 0.270533 0.292743 0.437788 0.461102 0.487275 0.55334 ... b 0.282981 0.337036 0.365025 0.377315 0.406012 0.469555 0.476021 ... c 0.0546479 0.204253 0.280208 0.287176 0.293531 0.308147 0.333568 ... d 0.178076 0.215828 0.323594 0.393147 0.399313 0.400745 0.486569 ... ...
Here the letters "a", "b", "c", "d" are the item names/identifiers and the numbers are the points/spike times.
records contain trains of points only (options -l and -y, or -ly)
Each record contains only points/spike times, but no item (name/identifier). A record may be empty, that is, may contain zero points/spike times. Each point/spike time must be a floating point number in the usual format (that is, a string satisfying the regular expression [+-]?[0-9]*[.]?[0-9]*[eE]?[+-]?[0-9]*).
For example, a file in this format may start like this:
0.257419 0.270533 0.292743 0.437788 0.461102 0.487275 0.55334 ... 0.282981 0.337036 0.365025 0.377315 0.406012 0.469555 0.476021 ... 0.0546479 0.204253 0.280208 0.287176 0.293531 0.308147 0.333568 ... 0.178076 0.215828 0.323594 0.393147 0.399313 0.400745 0.486569 ... ...
Since there are no item/neuron names/identifiers, items/neurons will automatically be named according to record numbers. That is, the item/neuron corresponding to the first row will be named "1", the item/neuron corresponding to the second row will be named "2" etc.
records contain pairs of item and point (default, no options needed)
Each record has two fields; the first field contains an item (name/identifier), the second field a point/spike time. The item (name/identifier) can be any string (provided it does not contain a field or record separator), the point/spike time must be a floating point number in the usual format (that is, a string satisfying the regular expression [+-]?[0-9]*[.]?[0-9]*[eE]?[+-]?[0-9]*). This is the default record format, which is exhibited by the example data file trains.txt that is available on the download page. This example file starts like this (note that it is sorted by points/spike times; however, this is not necessary for the script ccn+psf+psr.py — it could just as well be sorted by item/neuron name/identifier):
30 0.000118904 86 0.000187737 31 0.000211313 91 0.00510695 55 0.0063016 31 0.00656515 69 0.00857344 67 0.00973941 60 0.0114635 69 0.0124325 64 0.014419 73 0.0164179 11 0.0180543 52 0.0204831 82 0.0242861 ...
Here the numbers on the left are the item/neuron names/identifiers and the numbers on the right are the points/spike times.
records contain pairs of point and item (option -y)
Each record has two fields; the first field contains a point/spike time, the second field contains an item (name/identifier). The item (name/identifier) can be any string (provided it does not contain a field or record separator), the point/spike time must be a floating point number in the usual format (that is, a string satisfying the regular expression [+-]?[0-9]*[.]?[0-9]*[eE]?[+-]?[0-9]*).
For example, a file in this format may start like this:
0.000118904 30 0.000187737 86 0.000211313 31 0.00510695 91 0.0063016 55 0.00656515 31 0.00857344 69 0.00973941 67 0.0114635 60 0.0124325 69 0.014419 64 0.0164179 73 0.0180543 11 0.0204831 52 0.0242861 82 ...
Here the numbers on the left are the points/spike times and the numbers on the right are the item/neuron names/identifiers.

Note that in all of the above formats, commas or tabulator characters may also be used as the field separator without having to change any settings, since the comma and the tabulator character are also among the default field separators.

back to the top

Pattern Spectrum File

A pattern spectrum maps pattern signatures, that is, a pair of the size of a pattern (number of items/neurons) and its support (number of (co-)occurrences in the maximum independent set sense) to occurrence counts, that is, to the (average) number of patterns with such a signature that have been observed in the surrogate data sets.

The name of the pattern spectrum file can be specified with the option -P (by default, no pattern spectrum file is written). If a pattern spectrum file is written, it contains three columns: the first column contains the pattern size, the second column the pattern support and the third column the (average) occurrence count. An example pattern spectrum file starts like this:
2 2 27.35 2 3 82.748 2 4 188.141 2 5 338.811 2 6 504.222 2 7 632.721 2 8 693.565 2 9 666.21 2 10 574.587 2 11 445.557 2 12 316.625 ...

The fields in this file are separated by spaces and the records are separated by newline characters (making them the lines of a simple text file).

The size and the support are always written as integer numbers, the (average) occurrence count as an integer or a floating point number. Note that a pattern spectrum file that has been created outside of the script ccn+psf+psr.py and that states the size and/or the support as floating point numbers may cause problems with the pattern spectrum viewer. The script ccn+psf+psr.py writes the triplets of (size, support, count) sorted by size, and within the same size sorted by support.

back to the top

Pattern Output File

The pattern output file lists the found patterns (after pattern spectrum filtering and pattern set reduction), one pattern per record. The records of this file are separated by newline characters, making them the rows of a simple text file. Each record contains one found pattern, described as a list of items/neurons and a support indicator. The items/neurons are separated by spaces.

back to the top

References

Finding Neural Assemblies with Frequent Item Set Mining
David Picado-Muiño, Christian Borgelt, Denise Berger, George Gerstein, and Sonja Grün
Frontiers in Neuroinformatics 7:article 9
Frontiers Media, Lausanne, Switzerland 2013
doi:10.3389/fninf.2013.00009 frontiersin.org
accfim.pdf (1797 kb) accfim.ps.gz (772 kb) (14 pages)
Statistical Evaluation of Synchronous Spike Patterns Extracted by Frequent Itemset Mining
Emiliano Torre, David Picado-Muiño, Michael Denker, Christian Borgelt, and Sonja Grün
Frontiers in Computational Neuroscience, 7:article 132
Frontiers Media, Lausanne, Switzerland 2013
doi:10.3389/fninf.2013.00132 frontiersin.org
(13 pages)
Frequent Item Set Mining for Sequential Data: Synchrony in Neuronal Spike Trains
David Picado-Muiño and Christian Borgelt
Intelligent Data Analysis, accepted, scheduled for issue 18:6.
IOS Press, Amsterdam, Netherlands 2013/2014
Finding Frequent Synchronous Events in Parallel Point Processes
Christian Borgelt and David Picado-Muiño
Proc. 12th Int. Symposium on Intelligent Data Analysis, 116-126.
Springer-Verlag, Berlin/Heidelberg, Germany 2013
ida_13.pdf (443 kb) ida_13.ps.gz (491 kb) (11 pages)

back to the top

Copying

(MIT license, or more precisely Expat License; to be found in the file mit-license.txt in the directory pycoco/doc in the source package of the program, see also opensource.org and wikipedia.org)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

back to the top