Similarity joins in relational database systems:
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
San Rafael, Calif.
Morgan & Claypool Publishers
2014
|
Schriftenreihe: | Synthesis lectures on data management
38 |
Schlagworte: | |
Online-Zugang: | Klappentext Inhaltsverzeichnis |
Beschreibung: | XVII, 106 S. Ill. |
ISBN: | 9781627050289 |
Internformat
MARC
LEADER | 00000nam a2200000 cb4500 | ||
---|---|---|---|
001 | BV041791539 | ||
003 | DE-604 | ||
005 | 20140414 | ||
007 | t | ||
008 | 140410s2014 a||| |||| 00||| eng d | ||
020 | |a 9781627050289 |9 978-162-705-028-9 | ||
035 | |a (OCoLC)879398316 | ||
035 | |a (DE-599)BVBBV041791539 | ||
040 | |a DE-604 |b ger | ||
041 | 0 | |a eng | |
049 | |a DE-739 | ||
084 | |a ST 270 |0 (DE-625)143638: |2 rvk | ||
100 | 1 | |a Augsten, Nikolaus |e Verfasser |4 aut | |
245 | 1 | 0 | |a Similarity joins in relational database systems |c Nikolaus Augsten ; Michael H. Böhlen |
264 | 1 | |a San Rafael, Calif. |b Morgan & Claypool Publishers |c 2014 | |
300 | |a XVII, 106 S. |b Ill. | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
490 | 1 | |a Synthesis lectures on data management |v 38 | |
600 | 3 | 4 | |a Electronic books |
650 | 4 | |a Data mining | |
650 | 4 | |a Database management | |
650 | 4 | |a Information storage and retrieval systems | |
650 | 0 | 7 | |a Abstand |0 (DE-588)4228463-6 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Relationale Datenbank |0 (DE-588)4049358-1 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Abfrageverarbeitung |0 (DE-588)4378490-2 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Relationale Datenbank |0 (DE-588)4049358-1 |D s |
689 | 0 | 1 | |a Abfrageverarbeitung |0 (DE-588)4378490-2 |D s |
689 | 0 | 2 | |a Abstand |0 (DE-588)4228463-6 |D s |
689 | 0 | |5 DE-604 | |
700 | 1 | |a Böhlen, Michael |d 1964- |e Verfasser |0 (DE-588)121260542 |4 aut | |
830 | 0 | |a Synthesis lectures on data management |v 38 |w (DE-604)BV036766043 |9 38 | |
856 | 4 | 2 | |m Digitalisierung UB Passau - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027237145&sequence=000003&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Klappentext |
856 | 4 | 2 | |m Digitalisierung UB Passau - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027237145&sequence=000004&line_number=0002&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-027237145 |
Datensatz im Suchindex
_version_ | 1804152108769869824 |
---|---|
adam_text | Similarity
Joins
in Relational Database Systems
Nikolaus Augsten,
University of Salzburg
Michael H.
Bohlen,
University of Zurich
State-of-the-art database systems manage and process a variety of complex objects, including strings and trees.
For such objects equality comparisons are often not meaningful and must be replaced by similarity comparisons.
This book describes the concepts and techniques to incorporate similarity into database systems. We start out
by discussing the properties of strings and trees, and identify the edit distance as the
de
facto standard for
comparing complex objects. Since the edit distance is computationally expensive, token-based distances have
been introduced to speedup edit distance computations. The basic idea is to decompose complex objects into
sets of tokens that can be compared efficiently. Token-based distances are used to compute an approximation
of the edit distance and prune expensive edit distance calculations.
A key observation when computing similarity joins is that many of the object pairs, for which the similarity
is computed, are very different from each other.
Filters
exploit this property to improve the performance of
similarity
j
oins. A
filter preprocesses the input data sets and produces a set of candidate pairs. The distance
function is evaluated on the candidate pairs only. We describe the essential query processing techniques for
filters based on lower and upper bounds. For token equality joins we describe prefix, size, positional and
partitioning filters, which can be used to avoid the computation of small intersections that are not needed
since the similarity would be too low.
.
Xl
Preface
...........................................................xv
Acknowledgments
.................................................xvii
Introduction
....................................................... 1
1.1
Applications of Similarity Queries
.....................................1
1.2
Edit-Based
Similarity Measures
......................................4
1.3
Token-Based Similarity Measures
.....................................5
Data Types
..
2.1
Strings.
2.2
Trees
..
7
7
7
Edit-Based
Distances
...............................................11
3.1
String Edit Distance
..............................................11
3.1.1
Definition of the String Edit Distance
...........................11
3.1.2
Computation of the String Edit Distance
........................13
3.2
Tree Edit Distance
................................................15
3.2.1
Definition of the Tree Edit Distance
............................15
3.2.2
Computation of the Tree Edit Distance
..........................18
3.2.3
Constrained Tree Edit Distance
................................18
3.2.4
Unordered Tree Edit Distance
.................................19
3.3
Further Readings
.................................................22
Token-Based Distances
.............................................25
4.1
Sets and Bags
....................................................25
4.1.1
Counting Approach
.........................................25
4.1.2
Frequency Approach
.........................................25
4.2
Similarity Measures for Sets and Bags
.................................26
4.2.1
Overlap Similarity
...........................................26
4.2.2
Jaccard Similarity
...........................................26
4.2.3
Dice Similarity
.............................................27
4.2.4
Converting Threshold Constraints
..............................27
4.3
String Tokens
....................................................28
4.3.1
q-Gram Tokens
.............................................28
4.4
Tokens for Ordered Trees
..........................................29
4.4.1
Overview of Ordered Tree Tokens
..............................30
4.4.2
The pq-Gram Distance
.......................................32
4.4.3
An Algorithm for the pq-Gram Index
...........................37
4.4.4
Relational Implementation
....................................38
4.5
Tokens for Unordered Trees
.........................................41
4.5.1
Overview of Unordered Tree Tokens
............................42
4.5.2
Desired Properties for Unordered Tree Decompositions
.............43
4.5.3
The Windowed pq-Gram Distance
.............................46
4.5.4
Properties of Windowed pq-grams
..............................50
4.5.5
Building the Windowed pq-Gram Index
.........................56
4.6
Discussion: Properties of Tree Tokens
.................................57
4.7
Further Readings
.................................................59
Query Processing Techniques
........................................61
5.1
Filters
..........................................................61
5.2
Lower and Upper Bounds
..........................................62
5.3
String Distance Bounds
............................................63
5.3.1
Length Filter
...............................................63
5.3.2
Count Filter
................................................63
5.3.3
Positional Count Filter
.......................................65
5.3.4
Using String Filters in a Relational Database
......................65
5.4
Tree Distance Bounds
.............................................69
5.4.1
Size Lower Bound
...........................................69
5.4.2
Intersection Lower Bound
....................................69
5.4.3
Traversal String Lower Bound
.................................70
5.4.4
pq-Gram Lower Bound
......................................72
5.4.5
Binary Branch Lower Bound
..................................74
5.4.6
Constrained Edit Distance Upper Bound
........................76
5.5
Further Readings
.................................................78
Filters for Token Equality Joins
.......................................79
6.1
Token Equality Join
-
Avoiding Empty Intersections
.....................79
6.2
Prefix Filter
-
Avoiding Small Intersections
............................83
6.2.1
Prefix Filter for Overlap Similarity
..............................83
6.2.2
Prefix Filter forJaccard Similarity
...............................85
6.2.3
Effectiveness of Prefix Filtering
................................86
6.3
Size Filter
.......................................................87
6.4
Positional Filter
..................................................87
6.5
Partitioning Filter
.................................................88
6.6
Further Readings
.................................................88
Conclusion
.......................................................91
Bibliography
......................................................93
Authors Biographies
..............................................103
Index
...........................................................105
|
any_adam_object | 1 |
author | Augsten, Nikolaus Böhlen, Michael 1964- |
author_GND | (DE-588)121260542 |
author_facet | Augsten, Nikolaus Böhlen, Michael 1964- |
author_role | aut aut |
author_sort | Augsten, Nikolaus |
author_variant | n a na m b mb |
building | Verbundindex |
bvnumber | BV041791539 |
classification_rvk | ST 270 |
ctrlnum | (OCoLC)879398316 (DE-599)BVBBV041791539 |
discipline | Informatik |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>02116nam a2200457 cb4500</leader><controlfield tag="001">BV041791539</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20140414 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">140410s2014 a||| |||| 00||| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781627050289</subfield><subfield code="9">978-162-705-028-9</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)879398316</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV041791539</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-739</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 270</subfield><subfield code="0">(DE-625)143638:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Augsten, Nikolaus</subfield><subfield code="e">Verfasser</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Similarity joins in relational database systems</subfield><subfield code="c">Nikolaus Augsten ; Michael H. Böhlen</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">San Rafael, Calif.</subfield><subfield code="b">Morgan & Claypool Publishers</subfield><subfield code="c">2014</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XVII, 106 S.</subfield><subfield code="b">Ill.</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="1" ind2=" "><subfield code="a">Synthesis lectures on data management</subfield><subfield code="v">38</subfield></datafield><datafield tag="600" ind1="3" ind2="4"><subfield code="a">Electronic books</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data mining</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Database management</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Information storage and retrieval systems</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Abstand</subfield><subfield code="0">(DE-588)4228463-6</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Relationale Datenbank</subfield><subfield code="0">(DE-588)4049358-1</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Abfrageverarbeitung</subfield><subfield code="0">(DE-588)4378490-2</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Relationale Datenbank</subfield><subfield code="0">(DE-588)4049358-1</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Abfrageverarbeitung</subfield><subfield code="0">(DE-588)4378490-2</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="2"><subfield code="a">Abstand</subfield><subfield code="0">(DE-588)4228463-6</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Böhlen, Michael</subfield><subfield code="d">1964-</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)121260542</subfield><subfield code="4">aut</subfield></datafield><datafield tag="830" ind1=" " ind2="0"><subfield code="a">Synthesis lectures on data management</subfield><subfield code="v">38</subfield><subfield code="w">(DE-604)BV036766043</subfield><subfield code="9">38</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Passau - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027237145&sequence=000003&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Klappentext</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Passau - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027237145&sequence=000004&line_number=0002&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-027237145</subfield></datafield></record></collection> |
id | DE-604.BV041791539 |
illustrated | Illustrated |
indexdate | 2024-07-10T01:05:28Z |
institution | BVB |
isbn | 9781627050289 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-027237145 |
oclc_num | 879398316 |
open_access_boolean | |
owner | DE-739 |
owner_facet | DE-739 |
physical | XVII, 106 S. Ill. |
publishDate | 2014 |
publishDateSearch | 2014 |
publishDateSort | 2014 |
publisher | Morgan & Claypool Publishers |
record_format | marc |
series | Synthesis lectures on data management |
series2 | Synthesis lectures on data management |
spelling | Augsten, Nikolaus Verfasser aut Similarity joins in relational database systems Nikolaus Augsten ; Michael H. Böhlen San Rafael, Calif. Morgan & Claypool Publishers 2014 XVII, 106 S. Ill. txt rdacontent n rdamedia nc rdacarrier Synthesis lectures on data management 38 Electronic books Data mining Database management Information storage and retrieval systems Abstand (DE-588)4228463-6 gnd rswk-swf Relationale Datenbank (DE-588)4049358-1 gnd rswk-swf Abfrageverarbeitung (DE-588)4378490-2 gnd rswk-swf Relationale Datenbank (DE-588)4049358-1 s Abfrageverarbeitung (DE-588)4378490-2 s Abstand (DE-588)4228463-6 s DE-604 Böhlen, Michael 1964- Verfasser (DE-588)121260542 aut Synthesis lectures on data management 38 (DE-604)BV036766043 38 Digitalisierung UB Passau - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027237145&sequence=000003&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Klappentext Digitalisierung UB Passau - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027237145&sequence=000004&line_number=0002&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Augsten, Nikolaus Böhlen, Michael 1964- Similarity joins in relational database systems Synthesis lectures on data management Electronic books Data mining Database management Information storage and retrieval systems Abstand (DE-588)4228463-6 gnd Relationale Datenbank (DE-588)4049358-1 gnd Abfrageverarbeitung (DE-588)4378490-2 gnd |
subject_GND | (DE-588)4228463-6 (DE-588)4049358-1 (DE-588)4378490-2 |
title | Similarity joins in relational database systems |
title_auth | Similarity joins in relational database systems |
title_exact_search | Similarity joins in relational database systems |
title_full | Similarity joins in relational database systems Nikolaus Augsten ; Michael H. Böhlen |
title_fullStr | Similarity joins in relational database systems Nikolaus Augsten ; Michael H. Böhlen |
title_full_unstemmed | Similarity joins in relational database systems Nikolaus Augsten ; Michael H. Böhlen |
title_short | Similarity joins in relational database systems |
title_sort | similarity joins in relational database systems |
topic | Electronic books Data mining Database management Information storage and retrieval systems Abstand (DE-588)4228463-6 gnd Relationale Datenbank (DE-588)4049358-1 gnd Abfrageverarbeitung (DE-588)4378490-2 gnd |
topic_facet | Electronic books Data mining Database management Information storage and retrieval systems Abstand Relationale Datenbank Abfrageverarbeitung |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027237145&sequence=000003&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027237145&sequence=000004&line_number=0002&func_code=DB_RECORDS&service_type=MEDIA |
volume_link | (DE-604)BV036766043 |
work_keys_str_mv | AT augstennikolaus similarityjoinsinrelationaldatabasesystems AT bohlenmichael similarityjoinsinrelationaldatabasesystems |