Duplicate detection in XML data:
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Abschlussarbeit Buch |
Sprache: | English |
Veröffentlicht: |
Duisburg [u.a.]
WiKu
2008
|
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis |
Beschreibung: | XIV S., S. 15 - 229 graph. Darst. 21 cm |
ISBN: | 9783865532633 |
Internformat
MARC
LEADER | 00000nam a2200000zc 4500 | ||
---|---|---|---|
001 | BV023804956 | ||
003 | DE-604 | ||
005 | 20090426000000.0 | ||
007 | t | ||
008 | 090217s2008 d||| m||| 00||| eng d | ||
020 | |a 9783865532633 |9 978-3-86553-263-3 | ||
035 | |a (OCoLC)916005196 | ||
035 | |a (DE-599)BVBBV023804956 | ||
040 | |a DE-604 |b ger | ||
041 | 0 | |a eng | |
049 | |a DE-634 |a DE-11 | ||
084 | |a ST 250 |0 (DE-625)143626: |2 rvk | ||
100 | 1 | |a Weis, Melanie |e Verfasser |4 aut | |
245 | 1 | 0 | |a Duplicate detection in XML data |c Melanie Weis |
264 | 1 | |a Duisburg [u.a.] |b WiKu |c 2008 | |
300 | |a XIV S., S. 15 - 229 |b graph. Darst. |c 21 cm | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
502 | |a Zugl.: Berlin, Humboldt-Univ., Diss., 2007 | ||
650 | 0 | 7 | |a Erkennung |0 (DE-588)4328500-4 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Datensatz |0 (DE-588)4011133-7 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Ähnlichkeitsmaß |0 (DE-588)4642050-2 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Großes Datenbanksystem |0 (DE-588)4757631-5 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a XML |0 (DE-588)4501553-3 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Dublette |0 (DE-588)4191565-3 |2 gnd |9 rswk-swf |
655 | 7 | |0 (DE-588)4113937-9 |a Hochschulschrift |2 gnd-content | |
689 | 0 | 0 | |a XML |0 (DE-588)4501553-3 |D s |
689 | 0 | 1 | |a Datensatz |0 (DE-588)4011133-7 |D s |
689 | 0 | 2 | |a Dublette |0 (DE-588)4191565-3 |D s |
689 | 0 | 3 | |a Erkennung |0 (DE-588)4328500-4 |D s |
689 | 0 | 4 | |a Ähnlichkeitsmaß |0 (DE-588)4642050-2 |D s |
689 | 0 | 5 | |a Großes Datenbanksystem |0 (DE-588)4757631-5 |D s |
689 | 0 | |5 DE-604 | |
856 | 4 | 2 | |m HBZ Datenaustausch |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=017447132&sequence=000004&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-017447132 |
Datensatz im Suchindex
_version_ | 1804139004413607936 |
---|---|
adam_text | Contents
I XML Duplicate Detection 15
1 Duplicate Detection in a Nutshell 17
1.1 Introduction 17
1.2 Problem Formulation 21
1.2.1 XML and XML Schema by Example 21
1.2.2 Definition* 23
1.2.3 Definition Specification 29
1.2.4 Formalization of Iterative Duplicate Detection 30
1.2.5 Iterative Duplicate Detection Goals 32
1.3 Challenges and Contributions 33
1.3.1 Challenges 33
1.3.2 Contributions 35
1.4 Put into Perspective 36
1.5 Stnicture 39
2 DogmatiX 41
2.1 Duplicate Detection Framework 41
2.2 DogmatiX Framework Specialization 43
2.2.1 DogmatiX Candidate Definition 44
2.2.2 DogmatiX Duplicate Definition 44
2.2.3 DogmatiX Duplicate Detection 45
3 Scenarios 55
3.1 MOVIE Scenario 55
3.2 CDDB Scenario 57
3.3 DBLP Scenario W)
II Domain-Independent XML Duplicate Definition 63
4 Descriptinn Selection 6S
4.1 Schema Based Description Selection Methods 6
4.1.1 Heuristics 66
4.1.2 Additional Conditions 67
4.2 Instance Based Description Selection Methods 69
4.2.1 Statisticson XML Data 69
4.2.2 Conditions 70
4.3 Applying Description Selection Methods ?l
4.3.1 Combining Heuristics and Conditions ?
4.3.2 Choosing Heuristics and Conditions 72
xi
xü CONTENTS
5 Duplicate Classifiers 75
5.1 Requirements in XML 75
5.2 DogmatiX Similarity Measure 76
5.2.1 Comparable Descriptions 76
5.2.2 Similar Descriptions 77
5.2.3 Missing vs. contradictory descriptions 77
5.2.4 Data Relevance 78
5.2.5 XML Similarity Measure 79
5.3 Efficient Similarity Computation 80
5.3.1 Efficient Edit Distance Computation 80
5.3.2 AvoidingXMLSimilarily Compulation 82
5.3.3 Early Classilication 83
5.4 Other Similarity Measures 84
5.4.1 DELPHI Containment Metric 84
5.4.2 RC-ER Similarily Measure 85
5.4.3 Structure-aware XML Distance 86
5.4.4 RelatedWork 87
6 Duplicate Definition Evaluation 8
6.1 Experimental Setup 89
6.2 Effectiveness of XML Similarity Measure for Varying Description Selection . . 90
6.3 Comparative Evaluation ^
III Duplicate Detection Algorithms for Tree and Graph Data
7 Comparison Strategies Using Relationships 101
7.1 Detection of Duplicate Description Elements 102
7.2 Top-Down Algorithm l04
7.2.1 The DELPHI Algorithm for Relational Data ^
7.2.2 Description Selection and Description Generation l05
7.2.3 Top-Down Traversal 107
7.2.4 Duplicate Detection on Single Level 107
7.3 Bottom-Up Algorithm °
7.3.1 Relational Sorted Neighborhood Method
7.3.2 Configuration and Descriplion Generation
7.3.3 Bottom-Up Traversal
7.3.4 Duplicate Deteclion on Single Level .
7.4 DDG: Duplicate Detection in Graph Data
7.4.1 Comparison Order
7.4.2 Comparison Algorithms ,
7.4.3 Extensions X *
7.5 Other DDG Algorithms l23
7.5.1 Duplicate Detection in Personal Information Management
7.5.2 Iterative Relational Clustering for Entity Resolution (RC-ER)
7.5.3 Relationship-Based Data Cleaning (RelDC) 4
7.5.4 Learning in Collective Duplicate Detection
CONTENTS xiii
8 Scalability 127
8.1 General DDG Approach 127
8.1.1 Unified Graph Model for DDG 127
8.1.2 Unified DDG Initialisation 128
8.1.3 Iterative Phase 128
8.2 Scaling Up Initialisation 128
8.2.1 Graph Model in Database 129
8.2.2 Initializing the Priority Queue 130
8.2.3 Pre-Computations 130
8.3 Scaling Up Rctrieval Update 131
8.3.1 Scaling in Space with RECUS/DUP 131
8.3.2 Scaling in Time with RECUS/BUFF 133
8.4 Scaling UpClassification 136
8.4.1 SQL-Based Similarity Computalion 136
8.4.2 Hybrid Similarity Computation 137
8.4.3 EarlyClassification 139
9 Duplicate Detectinn Kvaluatinn 141
9.1 Elficiency of Duplicate Descriplion Element Deteclion 141
9.2 Cnmparison Slrategy Effcctivcncss 143
9.2.1 Top-Down Effcctivcness 143
9.2.2 Bottom-Up Effectivencss 145
9.2.3 RkconA and AdamA Effcctiveness 147
9.3 Comparison Strategy Efficiency 152
9.3.1 Candidate Filter Evaluation 152
9.3.2 Hierarchical Duplicate Detection Efficiency 153
9.3.3 Graph Duplicate Detection Efficiency 154
9.4 Scalability 156
9.4.1 Rctrieval and Update Scalability 156
9.4.2 Classification Scalability 157
9.4.3 Real-Wnrld Bchavior 159
9.4.4 Comparativc Evaluation 159
IV Systems 161
10 XCIean 163
10.1 XCIean Ovcrvicw 164
10.1.1 XCIean Architcclure 164
10.1.2 Operators 165
10.2 XCIean Programming 171
10.2.1 Language Rationale and Design 171
10.2.2 CornpiHng XCIean/PL to XQuery 172
10.3 Usagc report 173
10.3.1 UseCases 174
10.3.2 Quantitative Aspects 176
11 HumMer 179
11.1 Description Selection Component 180
11.2 Duplicate Detection Component 180
xiv CONTENTS
V Conclusion and Outlook 183
12 Cunclusion and Outlook 18
VI Appendix 189
A Sample Movie XML Scenario 191
A.l Sample XML Data l91
A.2 Sample XML Schema 192
B Scenario Schemas W
B.l Movie Scenario 193
B.2 CDDB Scenario . [ *
B.3 DBLP Scenario 1%
C XCIean 197
C.l XClean/PL Syntax ^
C.2 Use Case XCiean/PL Programs
C.2.1 CDDB Use Case .
C.2.2 Movie Use Case 204
C.2.3 DBLP Use Case . . 2I2
|
any_adam_object | 1 |
author | Weis, Melanie |
author_facet | Weis, Melanie |
author_role | aut |
author_sort | Weis, Melanie |
author_variant | m w mw |
building | Verbundindex |
bvnumber | BV023804956 |
classification_rvk | ST 250 |
ctrlnum | (OCoLC)916005196 (DE-599)BVBBV023804956 |
discipline | Informatik |
format | Thesis Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01833nam a2200457zc 4500</leader><controlfield tag="001">BV023804956</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20090426000000.0</controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">090217s2008 d||| m||| 00||| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9783865532633</subfield><subfield code="9">978-3-86553-263-3</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)916005196</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV023804956</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-634</subfield><subfield code="a">DE-11</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 250</subfield><subfield code="0">(DE-625)143626:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Weis, Melanie</subfield><subfield code="e">Verfasser</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Duplicate detection in XML data</subfield><subfield code="c">Melanie Weis</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Duisburg [u.a.]</subfield><subfield code="b">WiKu</subfield><subfield code="c">2008</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XIV S., S. 15 - 229</subfield><subfield code="b">graph. Darst.</subfield><subfield code="c">21 cm</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="502" ind1=" " ind2=" "><subfield code="a">Zugl.: Berlin, Humboldt-Univ., Diss., 2007</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Erkennung</subfield><subfield code="0">(DE-588)4328500-4</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Datensatz</subfield><subfield code="0">(DE-588)4011133-7</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Ähnlichkeitsmaß</subfield><subfield code="0">(DE-588)4642050-2</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Großes Datenbanksystem</subfield><subfield code="0">(DE-588)4757631-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">XML</subfield><subfield code="0">(DE-588)4501553-3</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Dublette</subfield><subfield code="0">(DE-588)4191565-3</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="655" ind1=" " ind2="7"><subfield code="0">(DE-588)4113937-9</subfield><subfield code="a">Hochschulschrift</subfield><subfield code="2">gnd-content</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">XML</subfield><subfield code="0">(DE-588)4501553-3</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Datensatz</subfield><subfield code="0">(DE-588)4011133-7</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="2"><subfield code="a">Dublette</subfield><subfield code="0">(DE-588)4191565-3</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="3"><subfield code="a">Erkennung</subfield><subfield code="0">(DE-588)4328500-4</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="4"><subfield code="a">Ähnlichkeitsmaß</subfield><subfield code="0">(DE-588)4642050-2</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="5"><subfield code="a">Großes Datenbanksystem</subfield><subfield code="0">(DE-588)4757631-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">HBZ Datenaustausch</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=017447132&sequence=000004&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-017447132</subfield></datafield></record></collection> |
genre | (DE-588)4113937-9 Hochschulschrift gnd-content |
genre_facet | Hochschulschrift |
id | DE-604.BV023804956 |
illustrated | Illustrated |
indexdate | 2024-07-09T21:37:11Z |
institution | BVB |
isbn | 9783865532633 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-017447132 |
oclc_num | 916005196 |
open_access_boolean | |
owner | DE-634 DE-11 |
owner_facet | DE-634 DE-11 |
physical | XIV S., S. 15 - 229 graph. Darst. 21 cm |
publishDate | 2008 |
publishDateSearch | 2008 |
publishDateSort | 2008 |
publisher | WiKu |
record_format | marc |
spelling | Weis, Melanie Verfasser aut Duplicate detection in XML data Melanie Weis Duisburg [u.a.] WiKu 2008 XIV S., S. 15 - 229 graph. Darst. 21 cm txt rdacontent n rdamedia nc rdacarrier Zugl.: Berlin, Humboldt-Univ., Diss., 2007 Erkennung (DE-588)4328500-4 gnd rswk-swf Datensatz (DE-588)4011133-7 gnd rswk-swf Ähnlichkeitsmaß (DE-588)4642050-2 gnd rswk-swf Großes Datenbanksystem (DE-588)4757631-5 gnd rswk-swf XML (DE-588)4501553-3 gnd rswk-swf Dublette (DE-588)4191565-3 gnd rswk-swf (DE-588)4113937-9 Hochschulschrift gnd-content XML (DE-588)4501553-3 s Datensatz (DE-588)4011133-7 s Dublette (DE-588)4191565-3 s Erkennung (DE-588)4328500-4 s Ähnlichkeitsmaß (DE-588)4642050-2 s Großes Datenbanksystem (DE-588)4757631-5 s DE-604 HBZ Datenaustausch application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=017447132&sequence=000004&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Weis, Melanie Duplicate detection in XML data Erkennung (DE-588)4328500-4 gnd Datensatz (DE-588)4011133-7 gnd Ähnlichkeitsmaß (DE-588)4642050-2 gnd Großes Datenbanksystem (DE-588)4757631-5 gnd XML (DE-588)4501553-3 gnd Dublette (DE-588)4191565-3 gnd |
subject_GND | (DE-588)4328500-4 (DE-588)4011133-7 (DE-588)4642050-2 (DE-588)4757631-5 (DE-588)4501553-3 (DE-588)4191565-3 (DE-588)4113937-9 |
title | Duplicate detection in XML data |
title_auth | Duplicate detection in XML data |
title_exact_search | Duplicate detection in XML data |
title_full | Duplicate detection in XML data Melanie Weis |
title_fullStr | Duplicate detection in XML data Melanie Weis |
title_full_unstemmed | Duplicate detection in XML data Melanie Weis |
title_short | Duplicate detection in XML data |
title_sort | duplicate detection in xml data |
topic | Erkennung (DE-588)4328500-4 gnd Datensatz (DE-588)4011133-7 gnd Ähnlichkeitsmaß (DE-588)4642050-2 gnd Großes Datenbanksystem (DE-588)4757631-5 gnd XML (DE-588)4501553-3 gnd Dublette (DE-588)4191565-3 gnd |
topic_facet | Erkennung Datensatz Ähnlichkeitsmaß Großes Datenbanksystem XML Dublette Hochschulschrift |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=017447132&sequence=000004&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT weismelanie duplicatedetectioninxmldata |