New Algorithms for Finding Approximate Frequent ... - Semantic Scholar

and RElim by a large margin, with the exception of the lowest support value for one insertion and a penalty of 0.5, where SODIM is slightly slower than both SaM ...
2MB Größe 1 Downloads 260 Ansichten
New Algorithms for Finding Approximate Frequent Item Sets Christian Borgelt1 , Christian Braune1,2 , Tobias K¨otter3 and Sonja Gr¨ un4,5

5

1 European Centre for Soft Computing c/ Gonzalo Guti´errez Quir´ os s/n, E-33600 Mieres (Asturias), Spain 2 Dept. of Computer Science, Otto-von-Guericke-University of Magdeburg Universit¨ atsplatz 2, D-39106 Magdeburg, Germany 3 Dept. of Computer Science, University of Konstanz Box 712, D-78457 Konstanz, Germany 4 RIKEN Brain Science Institute, Wako-Shi, Saitama 351-0198, Japan Institute of Neuroscience and Medicine (INM-6), Research Center J¨ ulich, Germany

[email protected], [email protected], [email protected], [email protected]

Abstract. In standard frequent item set mining a transaction supports an item set only if all items in the set are present. However, in many cases this is too strict a requirement that can render it impossible to find certain relevant groups of items. By relaxing the support definition, allowing for some items of a given set to be missing from a transaction, this drawback can be amended. The resulting item sets have been called approximate, fault-tolerant or fuzzy item sets. In this paper we present two new algorithms to find such item sets: the first is an extension of item set mining based on cover similarities and computes and evaluates the subset size occurrence distribution with a scheme that is related to the Eclat algorithm. The second employs a clustering-like approach, in which the distances are derived from the item covers with distance measures for sets or binary vectors and which is initialized with a one-dimensional Sammon projection of the distance matrix. We demonstrate the benefits of our algorithms by applying them to a concept detection task on the 2008/2009 Wikipedia Selection for schools and to the neurobiological task of detecting neuron ensembles in (simulated) parallel spike trains.

1

Introduction and Motivation

In many applications of frequent item set mining one faces the problem that the transaction data to analyze is imperfect: items that are actually contained in a transaction are not recorded as such. The reasons can be manifold, ranging from noise through measurement errors to an underlying feature of the observed process. For instance, in gene expression analysis, where one may try to find coexpressed genes with frequent item set mining [13], binary transaction data is often obtained by thresholding originally continuous data, which are easily affected by noise in the experimental setup or limitations of the measurement devices.

Analyzing alarm sequences in telecommunication data for frequent episodes can be impaired by alarms being delayed or dropped due to the fault causing the alarm also affecting the transmission system [36]. In neurobiology, where one searches for ensembles of neurons in parallel spike trains with the help of frequent item set mining and related approaches [17, 19, 4], ensemble neurons are expected to participate in synchronous activity only with a certain probability. In this paper we present two new algorithms to cope with such problems. The first algorithm relies on a standard item set enumeration scheme and is fairly closely related to item set mining based on cover similarities [32]. It efficiently computes the subset size occurrence distribution of item sets, evaluates this distribution to find fault-tolerant item sets, and uses intermediate data to remove pseudo (or spurious) item sets, which contain one or more items that are too weakly supported (that is, items that occur in too few of the supporting transactions) as to warrant including them in the item set. The second algorithm rather employs a heuristic search scheme (rather than full enumeration) and is inspired by the observation that in sparse data relevant item sets produce “lines” in a binary matrix representation of the transactional data if their items are properly reordered, or even “tiles” if the transaction are properly reorder as well. The interesting item sets are then found by linear traversals of the items and statistical tests. The main advantage of this method compared to item set enumeration schemes is that it considerably lowers the number of statistical tests that are needed to confirm the significance of found item sets. Thus it reduces the number of false positive results, while still being able to find the most interesting item sets (or at least part of them). The rest of this paper is structured as follows: in Section 2 we review the task of approximate/fault-tolerant item set mining and categorize different approaches to this task. (However, we do not consider item-weighted or uncertain transaction data.) In Section 3 we describe how our first algorithm traverses the search space, how it efficiently computes the subset size occurrence distribution for each item set it visits, and how this distribution is evaluated. In Section 4 we discuss how the intermediate/auxiliary data that is available in our algorithm can be used to easily cull pseudo (or spurious) item sets. In Section 5 we evaluate our first algorithm on artificially generated data with injected approximate item sets in order to confirm its effectiveness. In addition, we compare its performance to two other algorithms that fall into the same category, and for specific cases can be made to find the exact same item sets. In Section 6, we apply our first algorithm to a concept detection task on the 2008/2009 Wikipedia Selection for Schools to demonstrate its practical usefulness. In Section 7 we review basics for the second algorithm, namely methods to measure the (dis)similarity of item covers by means of similarity/distance measures for sets and binary vectors. In Section 8 we consider, as a first step, how these (dis)similarities can be used in a clustering algorithm to find interesting item sets. In Section 9 we improve on this approach by applying non-linear mapping to reorder the items, so that they can be tested in a linear traversal. In Section 10 we describe an application of this method to find neuron ensembles in (simulated) parallel spike trains. Finally, in Section 11 we draw conclusions and point out possible future work.

transactions

fault-tolerant

transactions

items transactions

perfect/standard

transactions

items transactions

transactions

items

pseudo/spurious

Fig. 1. Different types of item sets illustrated as binary matrices.

2

Approximate or Fault-Tolerant Item Set Mining

In standard frequent item set mining only transactions that contain all of the items in a given set are counted as supporting this set. In contrast to this, in approximate (or fault-tolerant or fuzzy) item set mining transactions that contain only a subset of the items can still support an item set, though possibly to a lesser degree than transactions that contain all items. Based on the illustration of these situations shown in the diagrams on the left and in the middle of Figure 1, approximate item set mining has also been described as finding almost pure (geometric or combinatorial) tiles of ones in a binary matrix that indicates which items are contained in which transactions [18]. In order to cope with missing items in the transaction data to analyze, several approximate (or fault-tolerant or fuzzy) frequent item set mining approaches have been proposed. They can be categorized roughly into three classes: (1) errorbased approaches, (2) density-based approaches, and (3) cost-based approaches. Error-based Approaches. Examples of error-based approaches are [27] and [3]. In the former the standard support measure is replaced by a fault-tolerant support, which allows for a maximum number of missing items in the supporting transactions, thus ensuring that the measure is still anti-monotone. The search algorithm itself is derived from the famous Apriori algorithm [2]. In [3] constraints are placed on the number of missing items as well as on the number of (supporting) transactions that do not contain an item in the set. Hence it is related to the tile-finding approach in [18]. However, it uses an enumeration search scheme that traverses sub-lattices of items and transactions, thus ensuring a complete search, while [18] relies on a heuristic scheme. Density-based Approaches. Rather than fixing a maximum number of missing items, density-based approaches allow a certain fraction of the items in a set to be missing from the transactions, thus requiring the corresponding binary matrix tile to have a minimum density. This means that for larger item sets more items are allowed to be missing than for smaller item sets. As a consequence, the measure is no longer anti-monotone, if the density requirement is to be fulfilled by each individual transaction. To overcome this [37] requires only that the average density over all supporting transaction must exceed a user-specified threshold, while [33] defines a recursive measure for the density of an item set. Cost-based Approaches. In error- or density-based approaches all transactions that satisfy the constraints contribute equally to the support of an item

set, regardless of how many items of the set they contain. In contrast to this, cost-based approaches define the contribution of transactions in proportion to the number of missing items. In [36, 6] this is achieved by means of user-provided item-specific costs or penalties, with which missing items can be inserted. These costs are combined generally with each other and with the initial transaction weight of 1 with the help of a t-norm (triangular norm, see [23] for a comprehensive treatment). In addition, a minimum weight for a transaction can be specified, by which the number of insertions can be limited. Note that the cost-based approaches can be made to contain the error-based approaches as limiting or extreme cases, since one may set the cost (or penalty) of inserting an item into a transaction in such a way that the transaction weight is not reduced. In this case limiting the number of insertions obviously has the same effect as allowing for a maximum number of missing items. The first approach presented in this paper falls into the category of cost-based approaches, since it reduces the support contribution of transactions that do not contain all items of a considered item set. How much the contribution is reduced and how many missing items are allowed can be controlled directly by a user. However, it treats all items the same, while the cost-based approaches reviewed above allow for item-specific penalties. Its advantages are that, depending on the data set, it can be faster, admits more sophisticated support/evaluation functions, and allows for a simple filtering of pseudo (or spurious) item sets. In pseudo (or spurious) item sets a subset of the items is strongly correlated in many transactions—possibly even a perfect item set (all subset items are contained in all supporting transactions). In such a case the remaining items may not occur in any (or only in very few) of the (fault-tolerantly) supporting transactions, but despite the ensuing reduction of the weight of all transactions, the item set support can still exceed the user-specified threshold (see Figure 1 on the right for an illustration of a pseudo (or spurious) item set and note the regular pattern of missing items compared to the middle diagram). Obviously, such item sets are not useful and should be discarded, which is easy in our algorithm, but difficult in the cost-based approaches reviewed above. The second approach we present in this paper is not easily classified into the above categories, because it does not rely on item set enumeration, but rather on a heuristic search scheme, which does not ensure a complete result. Since it decides by statistical tests which item sets are interesting/relevant, it does not use (explicit) transaction costs nor does it require strictly limited errors or a minimum density. It is most closely related to the tile-finding approach of [18], but significantly improves the item ordering criterion and thus can apply a much simpler method to actually identify the significant item sets. As a final comment we remark that a closely related setting is the case of uncertain transactional data, where each item is endowed with a transactionspecific weight or probability. This weight is meant to indicate the degree or chance with which the item is actually a member of the transaction. Approaches to this certainly related, but nevertheless fundamentally different problem, which we do not consider here, can be found, for example, in [12, 25, 1, 9].

3

Subset Size Occurrence Distribution

The basic idea of our first algorithm is to compute, for each visited item set, how many transactions contain subsets with 1, 2, . . . , k items, where k is the size of the considered item set. We call this the subset size occurrence distribution of the item set, as it states how often subsets of different sizes occur. Note that we do not distinguish different subsets with the same size, and thus treat all items the same: it is only relevant that (and if, how many) items are missing, but not, which specific items are missing. This distinguishes this algorithm from those presented in [36, 6], which allow item-specific insertion costs (or penalties). The subset size occurrence distribution is evaluated by a function that combines, in a weighted manner, the entries which refer to subsets of a user-specified minimum size, which is stated relative to the size of the item set itself and thus correspond to a maximum number of missing items. Item sets that reach a userspecified minimum value for the evaluation measure are reported. Computing the subset size occurrence distribution is surprisingly easy with the help of an intermediate array that records for each transaction how many of the items in the currently considered set are contained in it. In the search, which is a standard depth-first search in the subset lattice that can also be seen as a divide-and-conquer approach (see, for example, [6] for a formal description), this intermediate array is updated each time an item is added to or removed from the current item set. The counter update is most conveniently carried out with the help of transaction identifier lists. That is, our algorithm uses a vertical database representation and thus is closely related to the well-known Eclat algorithm [39]. The updated fields of the item counter array then give rise to updates of the subset size occurrence distribution, which records, for each subset size, how many transactions contain at least as many items of the current item set. Pseudo-code of the (simplified) recursive search procedure is shown in Figure 2. Together with the recursion the main while-loop implements the depthfirst/divide-and-conquer search by first including an item in the current set (first subproblem — handled by the recursive call) and then excluding it (second subproblem — handled by skipping the item in the while-loop). The for-loop at the beginning of the outer while-loop increments first the item counters for each transaction containing the current item n (note that items are coded by consecutive integers starting at 0), which thus is added to the current item set. Then the subset size occurrence distribution is updated by drawing on the new value of the updated item counters. Note that one could also remove a transaction from the counter for the original subset size (after adding the current item), so that the distribution array elements represent the number of transactions that contain exactly the number of items given by their indices. This could be achieved with an additional instruction dec(dist[cnts[t[i]]]) as the first statement of the for-loop. However, this operation is more costly than forming differences between neighboring elements in the evaluation function, which yields the same values (see Figure 4 — to be discussed later). As an illustration, Figure 3 shows an example of the update. The top row shows the list of transaction identifiers for the current item n (held in the pseudo-

global variables: lists : array of array of integer; cnts : array of integer; dist : array of integer; iset : set of integer; emin : real;

(may also be passed down in recursion) (∗ transaction identifier lists ∗) (∗ item counters, one per transaction ∗) (∗ subset size occurrence distribution ∗) (∗ current item set ∗) (∗ minimum evaluation of an item set ∗)

procedure sodim (n: integer);

(∗ (∗ (∗ (∗ (∗

var i : integer; t : array of integer; e : real; begin while n > 0 do begin n := n − 1; t := lists[n]; for i := 0 upto length(t)-1 do inc(cnts[t[i]]); inc(dist[cnts[t[i]]]); end; e := eval(dist, length(iset)+1); if e ≥ emin then begin add(iset, n); h report the current item set sodim(n); remove(iset, n); end; for i := 0 upto length(t)-1 do dec(dist[cnts[t[i]]]); dec(cnts[t[i]]); end; end; end; (∗ end of sodim() ∗)

n: number of selectable items ∗) (items are coded by integers 0 to n − 1) ∗) loop variable ∗) to access the transaction id lists ∗) item set evaluation result ∗)

(∗ while there are items left ∗) (∗ get the next item and its trans. ids ∗) begin (∗ traverse the transaction ids ∗) (∗ increment the item counter and ∗) (∗ the subset size occurrences, ∗) (∗ i.e., update the distribution ∗) (∗ evaluate subset size occurrence distrib. ∗) (∗ if the current item set qualifies ∗) (∗ add current item to the set ∗) iset i; (∗ recursively check supersets ∗) (∗ remove current item from the set, ∗) (∗ i.e., restore the original item set ∗) begin (∗ traverse the transaction ids ∗) (∗ decrement the subset size occurrences ∗) (∗ and then the item counter, ∗) (∗ i.e., restore the original distribution ∗)

Fig. 2. Simplified pseudo-code of the recursive search procedure.

code in the local variable t), which is traversed to select the item counters that have to be incremented. The second row shows these item counters, with old and unchanged counter values shown in black and updated values in blue. Using the new (blue) values as indices into the subset size distribution array, this distribution is updated. Again old and unchanged values are shown in black, new values in blue. Note that dist[0] always holds the total number of transactions. An important property of this update operation is that it is reversible. By traversing the transaction identifiers again, the increments can be retracted, thus restoring the original subset size occurrence distribution (before the current item n was added). This is exploited in the for-loop at the end of the outer whileloop in Figure 2, which restores the distribution, by first decrementing the subset size occurrence counter and then the item counter for the transaction (that is, the steps are reversed w.r.t. the update in the first for-loop).

lists[n]

transaction identifiers

1 2 4 7 8 11 increment

cnts counters 0 3 4 2 3 0 4 5 1 0 2 3 3 4 4 3 0 1 item (one per transaction) 7

0

1

2

3

4

5

6

8

9

10

11

increment

dist

0 2 5 7 7 8 12 9 1 4 5

4

3

2

global variables: wgts : array of real;

1

0

subset size occurrences

Fig. 3. Updating the subset size occurrence distribution with the help of an item counter array, which records the number of contained items per transaction.

(may also be passed down in recursion) (∗ weights per number of missing items ∗)

function eval (d: array of integer, (∗ d: subset size occurrence distribution ∗) k: integer) : real; (∗ k: number of items in the current set ∗) var i: integer; (∗ loop variable ∗) e: real; (∗ evaluation result ∗) begin e := d[k] · wgts[0]; (∗ initialize the evaluation result ∗) for i := 1 upto min(k, length(wgts)) do (∗ traverse the distribution ∗) e := e +(d[k − i] − d[k − i + 1]) · wgts[i]; return e; (∗ weighted sum of transaction counters ∗) end; (∗ end of eval() ∗) Fig. 4. Pseudo-code of a simple evaluation function.

Between the for-loops the subset size occurrence distribution is evaluated and if the evaluation result reaches a user-specified threshold, the extended item set is actually constructed and reported. Afterwards supersets of this item set are processed recursively and finally the current item is removed again. This is in perfect analogy to standard frequent item set algorithms like Eclat or FP-growth, which employ the same depth-first/divide-and-conquer scheme. The advantage of our algorithm is that the evaluation function has access to fairly rich information about the occurrences of subsets of the current item set. While standard frequent item set mining algorithms only compute (and evaluate) dist[k] (which always contains the standard support) and the JIM algorithm [32] computes and evaluates only dist[k], dist[1] (number of transactions that contain at least one item in the set), and dist[0] (total number of transactions), our algorithm knows (or can easily compute as a simple difference) how many transactions miss 1, 2, 3 etc. items. Of course, this additional information comes at a price, namely a higher processing time, but in return one obtains the possibility to compute much more sophisticated item set evaluations. Note, however, that the asymptotic time complexity of the search (in terms of the O notation) is not worsened (that is, the higher processing time is basically a constant factor), since all frequent item set mining algorithms using a depth-first search or divide-and-conquer scheme have, in the worst case, a time complexity

that is exponential in the number of items (or linear in the number of found item sets), as well as linear in the number of transactions. A very simple example of such an evaluation function is shown in Figure 4: it weights the numbers of transactions in proportion to the number of missing items. The weights can be specified by a user and are stored in a global weights array. We assume that wgts[0] = 1 and wgts[i] ≥ wgts[i + 1]. With this function fault-tolerant item sets can be found in a cost-based manner, where the costs are represented by the weights array. Note, however, that it can also be used to find item sets with an error based scheme by setting wgts[0] = . . . = wgts[r] = 1 for a user-chosen maximum number r of missing items and wgts[i] = 0 for all i > r. With such a “crisp” weighting scheme all transactions contribute equally, provided they lack no more than r items of the considered set. An obvious alternative to the simple weighting function of Figure 4 is to divide the final value of e by dist[1] in order to obtain an extended Jaccard measure—an approach that is inspired by the JIM algorithm [32]. In principle, all measures listed in [32] can be generalized in this way, by simply replacing the standard support (all items are contained) by the extended support computed in the function shown in Figure 4, thus providing a variety of measures. Note that the extended support computed by the above function as well as the extended Jaccard measure that can be derived from it are obviously anti-monotone, since each element of the subset size occurrence distribution is anti-monotone (if elements are paired from the number of items in the respective sets downwards, as demonstrated in Figure 4), while dist[1] is clearly monotone. This ensures the correctness of our algorithm in the sense that it is guaranteed to find all item sets satisfying the minimum evaluation criterion.

4

Removing Pseudo/Spurious Item Sets

As already mentioned above, pseudo (or spurious) item sets can result if there exists a subset of items that is strongly correlated and supported by many transactions. In such a case adding an item to this set may not reduce the support enough to let it fall below the user-specified threshold, even if this item is not contained in any of the transactions containing the correlated items. As an illustration consider the right diagram in Figure 1: the third item is contained in only one of the eight transactions. However, the total number of missing items in this binary matrix (and thus the extended support) is the same as in the middle diagram, which we would consider as a representation of an acceptable fault-tolerant item set, since each item occurs in a sufficiently large fraction of the supporting transactions. (Of course, what counts as “sufficient” is a matter of choice and thus must be specified by a user.) In order to cull such pseudo (or spurious) item sets from the output, we added to our algorithm a check whether all items of the set occur in a sufficiently large fraction of the supporting transactions. This check can be carried out in two forms: either the user specifies a minimum fraction of the support of an item set that must be produced from transactions containing the item (in this case

the reduced weights of transactions with missing items are considered) or he/she specifies a minimum fraction of the number of supporting transactions that must contain the item (in this case all transactions have the same unit weight). Both checks can fairly easily be carried out with the help of the vertical transaction representation (transaction identifier lists), the intermediate/auxiliary item counter array (with one counter per transaction) and the subset size occurrence distribution: One simply traverses the transaction identifier list for each item in the item set to check and computes the number of supporting transactions that contain the tested item (or the support contribution derived from these transactions). The result is then compared with the total number of supporting transactions (which is available in dist[m], where m is the number of weights—see Figure 4) or the extended support (the result of the evaluation function shown in Figure 4). If the result exceeds a user specified threshold (given as a fraction or a percentage) for all items in the set, the item set is accepted, otherwise it is discarded (from the output; but the set is still processed recursively, because these conditions are not anti-monotone and thus cannot be used for pruning). In addition, it is often beneficial to filter the output for closed item sets (no superset has the same support/evaluation) or maximal item sets (no superset has a support/evaluation exceeding the user-specified threshold). In principle, this can be achieved with the same methods that are used in standard frequent item set mining. In our algorithm we consider closedness or maximality only w.r.t. the standard support (all items contained), but in principle, it could also be implemented w.r.t. the more sophisticated measures. Note, however, that this notion of closedness differs from the notion introduced and used in [7, 28], which is based on δ-free item sets and is a mathematically more sophisticated approach. In principle, though, a check whether a found item set is closed w.r.t. this notion could be added to our algorithm, but we did not follow this path yet.

5

Experimental Evaluation

We implemented the described item set mining approach as a C program, called SODIM (Subset size Occurrence Distribution based Item set Mining), that was essentially derived from an Eclat implementation (which provided the initial setup of the transaction identifier lists). We implemented all measures listed in [32], even though for these measures (in their original form) the JIM algorithm is better suited, because they do not require subset occurrence values beyond dist[k], dist[1], and dist[0]. However, we also implemented the extended support and the extended Jaccard measure (as well as generalizations of all other measures described in [32]), which JIM cannot compute. We also added optional culling of pseudo (or spurious) item sets, thus providing possibilities far surpassing the JIM implementation. This SODIM implementation has been made publicly available under the GNU Lesser (Library) Public License.6 6

http://www.borgelt.net/sodim.html

In a first set of experiments we tested our implementation on data that was artificially generated with a program7 that was developed to simulated parallel neuronal spike trains (see also below, for example Section 10). We created a transaction database with 100 items and 10000 transactions, in which each item occurs in a transaction with 5% probability (independent items, so cooccurrences are entirely random). Into this database we injected six groups of co-occurring items, which ranged in size from 6 to 10 items and which partially overlapped (some items were contained in two groups). For each group we injected between 20 and 30 co-occurrences (that is, in 20 to 30 transactions the items of the group actually co-occur). In order to compensate for the additional item occurrences due to this, we reduced (for the items in the groups) the occurrence probabilities in the remaining transactions (that is, the transactions in which they did not co-occur) accordingly, so that all items shared the same individual expected frequency. In addition, we removed from each co-occurrence of a group of items one of its item, thus creating the noisy instances of item sets we try to find with the SODIM algorithm. Note that due to this deletion scheme none of the transactions contained all items in a given group.8 As a consequence, no standard frequent item set mining algorithm is able to detect the groups, regardless of the used minimum support threshold. We then mined this database with SODIM, using a minimum standard support (all items contained) of 0, a minimum extended support of 10 (with a weight of 0.5 for transactions with one missing item) and a minimum fraction of supporting transactions containing each item of 75%. In addition, we restricted the output to maximal item sets (based on standard support), in order to suppress the output of subsets of the injected groups. This experiment was repeated several times with different databases generated in the way described above.9 We observed that the injected groups were always perfectly detected, while only rarely a false positive result, usually with 4 items, was produced. In a second set of experiments we compared SODIM to the two other costbased methods reviewed in Section 2, namely RElim [36] and SaM [6]. As a test data set we chose the well-known BMS-Webview-1 data, which describes a web click stream from a leg-care company that no longer exists. This data set has been used in the KDD cup 2000 [24] as well as in many other comparisons of frequent item set mining algorithms. By properly parameterizing RElim and SaM (namely by choosing the same insertion penalty for all items and specifying a corresponding transaction weight threshold to limit the number of insertions), these methods can be made to find exactly the same item sets. We chose two insertion penalties (RElim and SaM) or downweighting factors for missing items 7 8

9

http://www.borgelt.net/genpst.html The only case in which groups can be complete is when the co-occurrences of two groups overlap accidentally and this fills (one of) the formerly created gaps. However, this is highly unlikely and we did not observe this case in our experiments. The script used to perform these experiments can be found in the source package of our SODIM implementation at http://www.borgelt.net/sodim.html.

1

log(time/s)

log(time/s)

3

one insertion SaM RElim 0 SODIM 100

200

2

1

300

400

500

two insertions SaM RElim SODIM 100

absolute support

200

300

400

500

absolute support

Fig. 5. Execution times on the BMS-Webview-1 data set. Light colors refer to an insertion penalty factor of 0.25, dark colors to an insertion penalty factor of 0.5.

(SODIM), namely 0.5 and 0.25, and tested with one and two insertions (RElim and SaM) or missing items (SODIM). The results, which were obtained on an Intel Core 2 Quad Q9650 (3GHz) computer with 8 GB main memory running Ubuntu Linux 10.04 (64 bit) and gcc version 4.4.3, are shown in Figure 5. Clearly, SODIM outperforms both SaM and RElim by a large margin, with the exception of the lowest support value for one insertion and a penalty of 0.5, where SODIM is slightly slower than both SaM and RElim. It should be noted, though, that this does not render SaM and RElim useless for fault-tolerant item set mining, because they offer options that SODIM does not, namely the possibility to define item-specific insertion penalties. (SODIM treats all items the same.) On the other hand, SODIM allows for more sophisticated evaluation measures and the removal of pseudo (or spurious) item sets. Hence all three algorithms are useful.

6

Application to Concept Detection

To demonstrate the practical usefulness of our method, we also applied it to the 2008/2009 Wikipedia Selection for schools10 , which is a subset of the English Wikipedia11 with about 5500 articles and more than 200,000 hyperlinks. We used a subset of this data set that does not contain articles belonging to the subjects “Geography”, “Countries” or “History”, resulting in a subset of about 3,600 articles and more than 65,000 hyperlinks. The excluded subjects do not affect the chemical topic we focus on in our experiment, but contain articles that reference many articles or that are referenced by many articles (such as United States with 2,230 references). Including the mentioned subject areas would lead to an explosion of the number of discovered item sets and thus would make it much more difficult to demonstrate the effects we are interested in. The 2008/2009 Wikipedia Selection for schools describes 118 chemical elements.12 However, there are 157 articles that reference the Chemical element 10 11 12

http://schools-wikipedia.org/ http://en.wikipedia.org/ http://schools-wikipedia.org/wp/l/List_of_elements_by_name.htm

Table 1. Results for different numbers of missing items. Missing items Transactions Chemical elements Other articles Not referencing 0 25 24 1 0 1 47 34 13 1 2 139 71 68 3 3 239 85 154 9

article or are referenced by it, so that simply collecting the referenced or referencing articles does not yield a good extensional representation of this concept. Searching for references to the Chemical element article thus results not only in articles describing chemical elements but also in other articles including Albert Einstein, Extraterrestrial Life, and Universe. Furthermore, there are 17 chemical elements (e.g. palladium) that do not reference the Chemical element article. In order to better filter articles that provide information about chemical elements, one may try to extend the query with the titles of articles that are frequently co-referenced with the Chemical element article, but are more specific than a reference to/from this article alone. In order to find such co-references, we apply our SODIM algorithm. In order to do so, we converted each article into a transaction, such that each referenced article is an item in the transaction of the referring article. This resulted in a transaction database with 3,621 transactions. We then ran our SODIM algorithm with a minimum item set size of 5 and a minimum support (all items contained) of 25 in order to find suitable coreferences. 29 of the 81 found item sets contain the item Chemical element and thus are candidates for the sets of co-referenced terms we are interested in. From these 29 item sets we chose the following item set for the subsequent experiments: {Oxygen, Electron, Hydrogen, Melting point, Chemical Element}. Considering the semantics of the terms in this set, we can expect it to provide a better characterization of the extension of the concept of a chemical element. In order to illustrate this, we retrieved from our selection of articles those containing the chosen item set, allowing for a varying number of missing items (from this set), which produces different selections of articles that are related to chemical elements. The results are shown in Table 1. The first column contains the allowed number of missing items (out of the five items in the item set stated above) and the second column the number of articles (transactions) that are retrieved under this condition. The third column states how many of these articles are actually about chemical elements, the fourth column how many other articles were retrieved (hence we always have column 3 plus column 4 equals column 2). Most interesting is the last column, which contains the number of discovered chemical elements that do not reference the Chemical Element article. Obviously, the selected item set is very specific in selecting articles about chemical elements, because if it has to be contained as a whole (no missing items), only one article is retrieved that is not about a chemical element. However, its recall properties are not particularly good, since it retrieves only 24 out of the 118 articles about chemical elements. The more missing items are allowed, the

Fig. 6. Distance/similarity matrix of 100 items (item covers) computed with the Dice measure (see Table 3) with an injected item set of size 20 (the darker the gray, the lower the pairwise distance). The data set underlying this distance matrix is depicted as a dot display in the middle diagram of Figure 8.

better the recall gets, though, of course, specificity goes down. However, this is compensated by the fact that also some articles about chemical elements are retrieved that do not reference the Chemical element article and hence this method provides a better recall than a simple retrieval based on a reference to the Chemical element article (at the price of a somewhat lower specificity). As a consequence, we believe that we can reasonably claim that finding approximate item sets with the SODIM algorithm can help to detect new concepts (we used a known concept only to have a standard to compare to) and to complete missing links and references for existing concepts.

7

Measuring Item Cover (Dis)Similarity

The approach described in [32], which we already referred to above, introduced the idea to evaluate item sets by the similarity of the covers of the contained items, where an item cover is the set of identifiers of transactions that contain the item. Under certain conditions (which can reasonably be assumed to hold in the area of parallel spike train analysis we study below), even a mere pairwise analysis can sometimes reveal the existence of strongly correlated sets of items. As an example, Figure 6 shows an item cover distance matrix for 100 items and a database with 10000 transactions, which was generated with the same program already used for the experiments reported in Section 5.13 Into this data set a single strongly correlated item set of size 20 was injected. Since the darker fields (apart from the diagonal, which represents the distance of items to themselves and thus necessarily represents perfectly similar—namely identical—item covers) exhibit a clear regular structure, the existence of such a group of correlated items is immediately apparent and the members of the group can easily be identified. In order to actually measure the (dis)similarity of (pairs of) item covers, one can employ any similarity or distance measure for sets or binary vectors 13

http://www.borgelt.net/genpst.html

Table 2. A contingency table for two items A and B. Item

B value 0 (B ∈ / t) 1 0 (A ∈ / t) n00 A 1 (A ∈ t) n10 sum n.0

(B ∈ t) n01 n11 n.1

sum n0. n1. n..

Table 3. Some distance measures used for comparing two item covers. Hamming [20] Jaccard [22] Tanimoto [35] Dice [15] Sørensen [34] Rogers & Tanimoto [30]

n01 + n10 n.. n01 + n10 = n01 + n10 + n11 n01 + n10 = n01 + n10 + 2n11 2(n01 + n10 ) = n00 + 2(n01 + n10 ) + n11 2n01 n10 = n11 n00 + n01 n10 n11 n00 − n01 n10 =1− √ n0. n1. n.0 n.1 1 n11 n.. − n.1 n1. = − √ 2 2 n0. n1. n.0 n.1

dHamming = dJaccard dDice dR&T

Yule [38]

dYule

χ2 [11]

dχ2

Correlation [16] dCorrelation

(since the set of identifiers of supporting transactions can also be represented by a binary vector ranging over all transactions in the database, with entries of value 1 indicating which are the supporting transactions). Recent extensive overviews of such measures include [10, 11]; a selection that can be reasonably be generalized to more than two sets or binary vectors can be found in [32]. Technically, all of these measures are computed from 2 × 2 contingency tables like the one shown in Table 2. The fields of such a contingency table count how many transactions contain both items (n11 ), only the first item (n10 ), only the second (n01 ) or neither (n00 ). The row and column marginals are simply ni. = ni0 + ni1 and n.i = n0i + n1i for i ∈ {0, 1}, while n.. = n0. + n1. = n.0 + n.1 is the total number of transactions in the database to analyze. In principle, we could use any of the measures listed in [10, 11] for our algorithm.14 However, since considering all of these is clearly infeasible, we decided— after some experimentation—on the subset shown in Table 3. We believe that this subset, though small, still is reasonably representative for the abundance of 14

Note that, since we are considering only pairwise comparisons here, we are less restricted than in [32], where measures referring to n01 and n10 other than through the sum of these quantities are not applicable.

available measures, covering several different fundamental ideas and emphasizing different aspects of the similarity of binary vectors.

8

Finding Item Sets with Noise Clustering

In order to assess how well the distance measures listed in Table 3 are able to distinguish between vectors that contain only random noise and vectors that contain possibly relevant correlations, we evaluated them by an outlier detection method that is based on noise clustering. In this evaluation we assume that all items not belonging to a relevant item set are outliers, so that after the removal of outliers only relevant items remain. This approach can be interpreted in two ways: in the first place it can be seen as a preprocessing method that focuses the search towards relevant items and reduces the computational costs of the actual item set finding step by culling the item base on which it has to be executed. This is particularly important if the subsequent step, which actually identifies the item sets, is an enumeration approach, as this has—in the worst case—computational costs that are exponential in the number of items. Secondly, if there are only few relevant sets of items to be detected and their structure is sufficiently clear (as can reasonably be expected, at least under certain conditions, in the area of spike train analysis), the method may already yield the desired item sets. A usable noise-based outlier detection method has been proposed in [29]. The algorithm introduced, in a manner similar to [14], a noise cloud, which has the same distance to every data point (here: item cover). A data point is assigned to the noise cloud if and only if there is no other data point in the data set that is closer to it than the noise cloud. At the beginning of the algorithm the distance to the noise cloud is 0 for all data points and thus all points belong to the noise cloud. Then the distance to the noise cloud is slowly increased. As a consequence, more and more data points fall out of the noise cloud (and may then be assigned to an actual cluster of items if clusters are formed rather than a mere outlier detection and elimination is carried out). Plotting the distance of the noise cloud against the number of points not belonging to the noise cluster leads to diagrams like those shown in Figure 7, which cover different data properties. The default parameters with which the underlying data sets were generated (using a neurobiological spike train simulator) are as stated in the caption, while deviating settings are stated above each diagram. Note that by varying the copy probability r we generate occurrences of item sets that miss some of the items that should be present, since it is our goal to develop a method that can find approximate (or fault-tolerant or fuzzy) item sets. Note also that all diagrams in the upper row are based on data with a copy probability of 1.0, thus generating perfect item sets. In the lower row, however, different smaller values of the copy probability were tested. Analyzing the steps in such curves allows us to draw some conclusions about possible clusters (relevant item sets) in the data. Item covers that have a lot of transactions in common have smaller distances for a properly chosen metric and therefore fall out of the noise cluster earlier than items that only have a

default

t=5000

p=0.01

m=1-20,21-40

1

1

1

1

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0

0

0

0

0.2

0.4

0.6

0.8

1

0

0.2

r=0.8

0.4

0.6

0.8

1

0

0.2

r=0.66

0.4

0.6

0.8

1

0

1

1

1

1

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0

0.2

0.4

0.6

0.8

1

0

0

0.2

0.4

0.6

0

0.2

0.4

0.6

0.8

1

r=0.50

0.8

1

0

0

0.2

0.4

0.6

0.8

1

distance measures Hamming Jaccard Dice 0.4 Rogers & Tanimoto Yule 0.2 Chi square Correlation 0 0

0.2

0.4

0.6

0.8

1

Fig. 7. Fraction of items not assigned to the noise cloud (vertical axis) plotted over the distance from the noise cloud (horizontal axis; note that all distance measure have range [0, 1]) for different distance measures. The default parameters of the underlying data sets are: n = 100 items, t = 10000 transactions, p = 0.02 probability of an item occurring in a transaction, m = 1–20 group of items potentially occurring together, c = 0.005 probability of a coincident occurrence event for the group(s) of items, r = 1 probability with which an item is actually copied from the coincident occurrence process. Deviations from these parameters are stated above the diagrams.

few transactions in common. Hence, by clustering items together that have low pairwise distances, we can find relevant item sets, while the noise cloud captures items that are not part of any relevant item sets. The fairly wide plateaus in the diagrams of Figure 7 (at least for most of the distance measures) indicate that this method is quite able to identify items belonging to a relevant set by relying on pairwise distances. Even if the copy probability drops to a value as low as 0.5 (only half of the items of the correlated set are, on average, contained in an occurrence of that set) the difference between the covers of items belonging to a relevant set and those not belonging to one remains clearly discernible. Judging from the width of the plateaus, and thus the clarity of detection, we can conclude that among the tested measures dDice yields the best results for a copy probability of 1. For a copy probability less than 1 dDice still yields very good results, but dYule appears to be somewhat more robust, as it is less heavily affected by the imperfect data. Unfortunately, from a pure outlier-focused application of this approach there arises a problem: even if all items can be detected that most likely belong to relevant item sets, the structure of these sets (which items belong to which set(s)) remains unknown, since data points are only assigned to the noise cluster or not. Figure 7 demonstrates this quite well: for the last diagram in the first row two sets of 20 items each (both with the same parameters) were injected

into a total of 100 items, but there is only one step of 40 items visible in the diagram, rather than two steps of 20 items. The reason for this is, of course, that the size and parameters of the injected item sets were the same, so that they behave the same w.r.t. the noise cloud, even though their pairwise distance structure certainly indicates that not all of these items belong to one set. In order to separate such item sets, one may consider applying a standard clustering algorithm to the non-outlier items to which the above algorithm has reduced the data set, or to apply the original version of the noise-clustering algorithm of [14], with a distance for the noise cluster that can be selected by evaluating the plateaus of the curves shown in Figure 7. However, such an approach still has the disadvantage that the result is based solely on pairwise distances and does not actually check for a true higher-order correlation of the items. In principle, they could still be correlated in smaller, overlapping subsets, which can lead to the same pairwise distances.

9

Sorting with Non-Linear Mappings

To avoid having to test all subsets of items in order to detect true higher-order correlations, we developed an algorithm that is inspired by the tile-finding algorithm of [18]. The basic idea of this algorithm is that relevant item sets form “lines” in a dot-display of the transactional data if they are properly sorted. An example is shown in Figure 8: while the data set that is depicted in the top diagram contains only random noise (independent items), the diagram in the middle displays data into which a group of 20 correlated items was injected (this is the same data set from which Figure 6 was computed). However, since the items of this set are randomly interspersed with independent items, there are no visual clues by which one can distinguish it reliably from the top diagram. However, if the correlated items are relocated to the bottom of the diagram, as shown in the bottom diagram, clear lines become visible, by which the correlated set of items can be identified. If one also reorders the transactions, so that those, in which the items of the correlated group occur together, are placed on the very left, the item set becomes visible as a tile in the lower left corner. Based on this fundamental idea, the main steps of our algorithm consist in finding a proper reordering on the items and then to apply a sequence of statistical tests in order to actually identify, in a statistically sound way, significant item sets. Note that in this respect the advantage of reordering the items is that no complete item set enumeration is needed, but a simple linear traversal of the reordered items suffices. This is important in order to reduce the number of necessary statistical tests and thus to reduce the number of false positive results (and generally to mitigate the problem of multiple testing and the loss of control of significance ensuing from it). In [18] it was proposed to use a concept from linear algebra to find an appropriate sorting of the items. The idea is to compute the symmetric matrix LS = RS − S, where S = (sij ) is the similarity P matrix of the item covers and RS = (rii ) is a diagonal matrix with rii = j sij . The elements sij of the ma-

neurons (1–100) neurons (1–100)

time (0–10s)

neurons (1–100)

time (0–10s)

time (0–10s)

Fig. 8. Dot-displays of two transaction databases (horizontal: transactions, vertical: items) generated with neurobiological spike train simulator. The top diagram shows independent items, while the bottom diagrams contain a group of 20 correlated items. In the middle diagram, these items are randomly interspersed with independent items, while in the bottom diagram they are sorted into the bottom rows.

trix S may be computed, for example, as 1 − dij , where dij is the distance of the covers of the items i and j, for which all measures shown in Figure 3 can be used. In principle, however, all similarity and distance measures listed in [11] may be used, at least if properly adjusted. From this matrix LS the so-called Fiedler vector is computed, which is the eigenvector corresponding to the smallest nonzero eigenvalue. The items are then reordered according to the coordinate values that are assigned to them in the Fiedler vector. In [18] it is argued that this is a good P choice, because the Fiedler Vector minimizes the stress function x> Ls x = i,j sij · (xi − xj )2 w.r.t. the constraints x> e = 0, where e = (1, . . . , 1), and x> x = 1 (as can be shown fairly easily). However, experiments we carried out with this approach showed that even when there was only one set of correlated items, reordering them according to the Fiedler vector did not sort the members of this set properly together, thus leading to highly dissatisfying results. We therefore looked for alternatives and since we want to place similar item covers (or item covers having a small distance) close to each other, the Sammon projection [31] suggests itself. This algorithm maps data points from a high-dimensional space onto a low-dimensional space, usually a plane, with the goal to preserve the pairwise distances between the data points as accurately as possible. Formally, it tries to minimize the error function X (dij − d∗ij )2 , dij i