E-Book, Englisch, Band 32, 227 Seiten
Reihe: Advances in Database Systems
Zezula / Amato / Dohnal Similarity Search
1. Auflage 2006
ISBN: 978-0-387-29151-2
Verlag: Springer US
Format: PDF
Kopierschutz: 1 - PDF Watermark
The Metric Space Approach
E-Book, Englisch, Band 32, 227 Seiten
Reihe: Advances in Database Systems
ISBN: 978-0-387-29151-2
Verlag: Springer US
Format: PDF
Kopierschutz: 1 - PDF Watermark
The area of similarity searching is a very hot topic for both research and c- mercial applications. Current data processing applications use data with c- siderably less structure and much less precise queries than traditional database systems. Examples are multimedia data like images or videos that offer query by example search, product catalogs that provide users with preference based search, scientific data records from observations or experimental analyses such as biochemical and medical data, or XML documents that come from hetero- neous data sources on the Web or in intranets and thus does not exhibit a global schema. Such data can neither be ordered in a canonical manner nor meani- fully searched by precise database queries that would return exact matches. This novel situation is what has given rise to similarity searching, also - ferred to as content based or similarity retrieval. The most general approach to similarity search, still allowing construction of index structures, is modeled in metric space. In this book. Prof. Zezula and his co authors provide the first monograph on this topic, describing its theoretical background as well as the practical search tools of this innovative technology.
Autoren/Hrsg.
Weitere Infos & Material
1;Contents;7
2;Foreword;13
3;Preface;15
4;Acknowledgments;17
5;PART I METRIC SEARCHING IN A NUTSHELL;18
5.1;Chapter 1 FOUNDATIONS OF METRIC SPACE SEARCHING;22
5.1.1;1. The Distance Searching Problem;23
5.1.2;2. The Metric Space;25
5.1.3;3. Distance Measures;26
5.1.3.1;3.1 Minkowski Distances;27
5.1.3.2;3.2 Quadratic Form Distance;28
5.1.3.3;3.3 Edit Distance;29
5.1.3.4;3.4 Tree Edit Distance;30
5.1.3.5;3.5 Jaccard's Coefficient;30
5.1.3.6;3.6 Hausdorff Distance;31
5.1.3.7;3.7 Time Complexity;31
5.1.4;4. Similarity Queries;32
5.1.4.1;4.1 Range Query;32
5.1.4.2;4.2 Nearest Neighbor Query;33
5.1.4.3;4.3 Reverse Nearest Neighbor Query;34
5.1.4.4;4.4 Similarity Join;34
5.1.4.5;4.5 Combinations of Queries;35
5.1.4.6;4.6 Complex Similarity Queries;35
5.1.5;5. Basic Partitioning Principles;37
5.1.5.1;5.1 Ball Partitioning;37
5.1.5.2;5.2 Generalized Hyperplane Partitioning;38
5.1.5.3;5.3 Excluded Middle Partitioning;38
5.1.6;6. Principles of Similarity Query Execution;39
5.1.6.1;6.1 Basic Strategies;39
5.1.6.2;6.2 Incremental Similarity Search;42
5.1.7;7. Policies for Avoiding Distance Computations;43
5.1.7.1;7.1 Explanatory Example;44
5.1.7.2;7.2 Object-Pivot Distance Constraint;45
5.1.7.3;7.3 Range-Pivot Distance Constraint;47
5.1.7.4;7.4 Pivot-Pivot Distance Constraint;48
5.1.7.5;7.5 Double-Pivot Distance Constraint;50
5.1.7.6;7.6 Pivot Filtering;51
5.1.8;8. Metric Space Transformations;52
5.1.8.1;8.1 Metric Hierarchies;53
5.1.8.1.1;8.1.1 Lower-Bounding Functions;53
5.1.8.2;8.2 User-Defined Metric Functions;55
5.1.8.2.1;8.2.1 Searching Using Lower-Bounding Functions;55
5.1.8.3;8.3 Embedding Metric Space;56
5.1.8.3.1;8.3.1 Embedding Examples;56
5.1.8.3.2;8.3.2 Reducing Dimensionality;57
5.1.9;9. Approximate Similarity Search;58
5.1.9.1;9.1 Principles;58
5.1.9.2;9.2 Generic Algorithms;61
5.1.9.3;9.3 Measures of Performance;63
5.1.9.3.1;9.3.1 Improvement in Efficiency;63
5.1.9.3.2;9.3.2 Precision and Recall;63
5.1.9.3.3;9.3.3 Relative Error on Distances;65
5.1.9.3.4;9.3.4 Position Error;66
5.1.10;10. Advanced Issues;67
5.1.10.1;10.1 Statistics on Metric Datasets;68
5.1.10.1.1;10.1.1 Distribution and Density Functions;68
5.1.10.1.2;10.1.2 Distance Distribution and Density;69
5.1.10.1.3;10.1.3 Homogeneity of Viewpoints;71
5.1.10.2;10.2 Proximity of Ball Regions;72
5.1.10.3;10.3 Performance Prediction;75
5.1.10.4;10.4 Tree Quality IMeasures;77
5.1.10.5;10.5 Choosing Reference Points;80
5.2;Chapter 2 SURVEY OF EXISTING APPROACHES;84
5.2.1;1. Ball Partitioning IMethods;84
5.2.1.1;1.1 Burkhard- Keller Tree;85
5.2.1.2;1.2 Fixed Queries Tree;86
5.2.1.3;1.3 Fixed Queries Array;87
5.2.1.4;1.4 Vantage Point Tree;89
5.2.1.5;1.5 Excluded Middle Vantage Point Forest;92
5.2.2;2. Generalized Hyperplane Partitioning Approaches;93
5.2.2.1;2.1 Bisector Tree;93
5.2.2.2;2.2 Generalized Hyperplane Tree;94
5.2.3;3. Exploiting Pre- Computed Distances;95
5.2.3.1;3.1 AESA;95
5.2.3.2;3.2 Linear AESA;96
5.2.3.3;3.3 Other IMethods;97
5.2.4;4. Hybrid Indexing Approaclies;98
5.2.4.1;4.1 Multi Vantage Point Tree;98
5.2.4.2;4.2 Geometric Near-neighbor Access Tree;99
5.2.4.3;4.3 Spatial Approximation Tree;102
5.2.4.4;4.4 M-tree;104
5.2.4.5;4.5 Similarity Hashing;105
5.2.5;5. Approximate Similarity Search;106
5.2.5.1;5.1 Exploiting Space Transformations;106
5.2.5.2;5.2 Approximate Nearest Neighbors with BBD Trees;107
5.2.5.3;5.3 Angle Property Technique;109
5.2.5.4;5.4 Clustering for Indexing;111
5.2.5.5;5.5 Vector Quantization Index;112
5.2.5.6;5.6 Buoy Indexing;114
5.2.5.7;5.7 Hierarchical Decomposition of IMetric Spaces;114
5.2.5.7.1;5.7.1 Relative Error Approximation;115
5.2.5.7.2;5.7.2 Good Fraction Approximation;115
5.2.5.7.3;5.7.3 Small Chance Improvement Approximation;115
5.2.5.7.4;5.7.4 Proximity-Based Approximation;116
5.2.5.7.5;5.7.5 PAC Nearest Neighbor Search;116
6;PART II METRIC SEARCHING IN LARGE COLLECTIONS OF DATA;118
6.1;Chapter 3 CENTRALIZED INDEX STRUCTURES;122
6.1.1;1. M-tree Family;122
6.1.1.1;1.1 The M-tree;122
6.1.1.2;1.2 Bulk-Loading Algorithm of IVl-tree;126
6.1.1.3;1.3 Multi-Way Insertion Algorithm;129
6.1.1.4;1.4 The Slim Tree;130
6.1.1.4.1;1.4.1 Slim-Down Algorithm;131
6.1.1.4.2;1.4.2 Generalized Slim-Down Algorithm;133
6.1.1.5;1.5 Pivoting M-tree;135
6.1.1.6;1.6 The M+-tree;138
6.1.1.7;1.7 The M2 -tree;141
6.1.2;2. Hash-based metric indexing;142
6.1.2.1;2.1 The D-index;143
6.1.2.1.1;2.1.1 Insertion and Search Strategies;146
6.1.2.2;2.2 The eD-index;148
6.1.2.2.1;2.2.1 Similarity Self-Join Algorithm with eD-index;150
6.1.3;3. Performance Trials;153
6.1.3.1;3.1 Datasets and Distance Measures;154
6.1.3.2;3.2 Performance Comparison;155
6.1.3.3;3.3 Different Query Types;157
6.1.3.4;3.4 Scalability;158
6.2;Chapter 4 APPROXIMATE SIMILARITY SEARCH;162
6.2.1;1. Relative Error Approximation;162
6.2.2;2. Good Fraction Approximation;165
6.2.3;3. Small Chance Improvement Approximation;167
6.2.4;4. Proximity-Based Approximation;169
6.2.5;5. PAC Nearest Neighbor Searching;170
6.2.6;6. Performance Trials;171
6.2.6.1;6.1 Range Queries;172
6.2.6.2;6.2 Nearest Neighbors Queries;173
6.2.6.3;6.3 Global Considerations;176
6.3;Chapter 5 PARALLEL AND DISTRIBUTED INDEXES;178
6.3.1;1. Preliminaries;178
6.3.1.1;1.1 Parallel Computing;179
6.3.1.2;1.2 Distributed Computing;180
6.3.1.2.1;1.2.1 Scalable and Distributed Data Structures;180
6.3.1.2.2;1.2.2 Peer-to-Peer Data Networks;181
6.3.2;2. Processing M-trees with Parallel Resources;181
6.3.2.1;2.1 CPU Parallelism;182
6.3.2.2;2.2 I/O Parallelism;182
6.3.2.3;2.3 Object Declustering in M-trees;184
6.3.3;3. Scalable Distributed Similarity Search Structure;184
6.3.3.1;3.1 Architecture;185
6.3.3.2;3.2 Address Search Tree;186
6.3.3.3;3.3 Storage Management;186
6.3.3.3.1;3.3.1 Bucket Splitting;187
6.3.3.3.2;3.3.2 Choosing Pivots;188
6.3.3.4;3.4 Insertion of Objects;188
6.3.3.5;3.5 Range Search;189
6.3.3.6;3.6 Nearest Neighbor Search;190
6.3.3.7;3.7 Deletions and Updates of Objects;191
6.3.3.8;3.8 Image Adjustment;192
6.3.3.9;3.9 Logarithmic Replication Strategy;194
6.3.3.10;3.10 Joining the Peer-to-Peer Network;195
6.3.3.11;3.11 Leaving the Peer-to-Peer Network;195
6.3.4;4. Performance Trials;196
6.3.4.1;4.1 Datasets and Computing Infrastructure;197
6.3.4.2;4.2 Performance of Similarity Queries;197
6.3.4.2.1;4.2.1 Global Costs;198
6.3.4.2.2;4.2.3 Comparison of Search Algorithms;205
6.3.4.3;4.3 Data Volume Scalability;206
7;Concluding Summary;210
8;References;214
9;Author Index;228
10;Index;232
11;Abbreviations;236
Chapter 3 CENTRALIZED INDEX STRUCTURES (p. 105-106)
Most existing search structures have been designed to run on a single computer. Let us call them centralized structures. They are built with different assumptions about type of distance function (discrete vs. continuous), form of query (range, nearest neighbor, etc.), and temporal properties (static or dynamic) of the data to be organized. Although many index structures have been proposed as main memory structures, there are several indexes which organize data using disk storage to allow the processing of a large volume of data. In what follows, we focus on two basic approaches which store objects in secondary memories. Specifically, we discuss tree-based structures and methods which employ hashing (i.e., the key-to-address transformation) principles.
1. M-tree Family
[Ciaccia et al., 1997b] have proposed a dynamic organization, called the M-tree, which can handle data files that vary dynamically in size, i.e., in cases when insertion and deletion of objects is frequent. In contrast to other metric tree-based structures, the M-tree is built bottom-up and maintains the same length for all tree paths because the tree is balanced. This paradigm has become very popular and many researches have developed extensions of the M-tree storage structure with a main objective of increasing search efficiency, sometimes conditioned by specific application requirements. We start with the original idea of the M-tree, then describe its most important extensions.
1.1 The M-tree
Most of the indexing methods described in Chapter 2 are either static, unbalanced, or both. They are not very suitable for dynamic environments where data is subject to permanent alteration, nor for large data repositories where disk- based techniques are necessary. The M-tree is, by nature, designed as a dynamic and balanced index structure capable of organizing data stored on a disk. By building the tree in a bottom-up fashion from its leaves to its root, the M-tree shares some similarities with R-trees [Guttman, 1984] and B-trees [Comer, 1979].
This concept results in a balanced tree structure independent of the number of insertions or deletions and has a positive impact on query execution. In general, the M-tree behaves like the R-tree. All objects are stored in (or referenced from) leaf nodes while internal nodes keep pointers to nodes at the next level, together with additional information about their subtrees. Recall that R-trees store minimum bounding rectangles in non-leaf nodes that cover their subtrees. In general metric spaces, we cannot define such bounding rectangles because a coordinate system is lacking.
Thus M-trees use an object called a pivot, and a covering radius, to form a bounding ball region. In the M-tree, pivots play a role similar to that in the GNAT access structure [Brin, 1995], but unlike in GNAT, all objects are stored in leaves. Because pre-selected objects are used, the same object may be present several times in the M-tree - once in a leaf node, and once or several times in internal nodes as a pivot.




