Data-mining the web: uncovering patterns in Web content, structure, and usage
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Hoboken, NJ
Wiley
2007
|
Schriftenreihe: | Wiley series on methods and applications in data mining
|
Schlagworte: | |
Online-Zugang: | Beschreibung für Leser Table of contents only Inhaltsverzeichnis |
Beschreibung: | XVI, 218 S. Ill., graph. Darst. |
ISBN: | 0471666556 9780471666554 |
Internformat
MARC
LEADER | 00000nam a2200000zc 4500 | ||
---|---|---|---|
001 | BV022465187 | ||
003 | DE-604 | ||
005 | 20081209 | ||
007 | t | ||
008 | 070614s2007 xxuad|| |||| 00||| eng d | ||
010 | |a 2006025099 | ||
020 | |a 0471666556 |c cloth |9 0-471-66655-6 | ||
020 | |a 9780471666554 |9 978-0-471-66655-4 | ||
035 | |a (OCoLC)70823224 | ||
035 | |a (DE-599)BVBBV022465187 | ||
040 | |a DE-604 |b ger |e aacr | ||
041 | 0 | |a eng | |
044 | |a xxu |c US | ||
049 | |a DE-20 |a DE-863 |a DE-945 |a DE-473 |a DE-634 |a DE-11 | ||
050 | 0 | |a QA76.9.D343 | |
082 | 0 | |a 005.74 | |
084 | |a ST 530 |0 (DE-625)143679: |2 rvk | ||
100 | 1 | |a Markov, Zdravko |e Verfasser |4 aut | |
245 | 1 | 0 | |a Data-mining the web |b uncovering patterns in Web content, structure, and usage |c Zdravko Markov and Daniel T. Larose |
246 | 1 | 3 | |a Data mining the web |
264 | 1 | |a Hoboken, NJ |b Wiley |c 2007 | |
300 | |a XVI, 218 S. |b Ill., graph. Darst. | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
490 | 0 | |a Wiley series on methods and applications in data mining | |
650 | 4 | |a Bases de données sur le Web | |
650 | 4 | |a Exploration de données (Informatique) | |
650 | 4 | |a Data mining | |
650 | 4 | |a Web databases | |
650 | 0 | 7 | |a Data Mining |0 (DE-588)4428654-5 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Data Mining |0 (DE-588)4428654-5 |D s |
689 | 0 | |5 DE-604 | |
700 | 1 | |a Larose, Daniel T. |e Verfasser |0 (DE-588)1062529189 |4 aut | |
856 | 4 | |u http://www.loc.gov/catdir/enhancements/fy0740/2006025099-b.html |3 Beschreibung für Leser | |
856 | 4 | |u http://www.loc.gov/catdir/toc/ecip0618/2006025099.html |3 Table of contents only | |
856 | 4 | 2 | |m HBZ Datenaustausch |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=015672793&sequence=000004&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-015672793 |
Datensatz im Suchindex
DE-BY-863_location | 1000 |
---|---|
DE-BY-FWS_call_number | 1000/ST 530 M346 |
DE-BY-FWS_katkey | 307315 |
DE-BY-FWS_media_number | 083101003880 |
_version_ | 1806176782490009600 |
adam_text | CONTENTS
PREFACE xi
PART I
WEB STRUCTURE MINING
1 INFORMATION RETRIEVAL AND WEB SEARCH 3
Web Challenges 3
Web Search Engines 4
Topic Directories 5
Semantic Web 5
Crawling the Web 6
Web Basics 6
Web Crawlers 7
Indexing and Keyword Search 13
Document Representation 15
Implementation Considerations 19
Relevance Ranking 20
Advanced Text Search 28
Using the HTML Structure in Keyword Search 30
Evaluating Search Quality 32
Similarity Search 36
Cosine Similarity 36
Jaccard Similarity 38
Document Resemblance 41
References 43
Exercises 43
2 HYPERLINK-BASED RANKING 47
Introduction 47
Social Networks Analysis 48
PageRank 50
Authorities and Hubs 53
Link-Based Similarity Search 55
Enhanced Techniques for Page Ranking 56
References 57
Exercises 57
vii
Viii CONTENTS
PART II
WEB CONTENT MINING
3 CLUSTERING 61
Introduction 61
Hierarchical Agglomerative Clustering 63
A Means Clustering 69
Probabilty-Based Clustering 73
Finite Mixture Problem 74
Classification Problem 76
Clustering Problem 78
Collaborative Filtering (Recommender Systems) 84
References 86
Exercises 86
4 EVALUATING CLUSTERING 89
Approaches to Evaluating Clustering 89
Similarity-Based Criterion Functions 90
Probabilistic Criterion Functions 95
MDL-Based Model and Feature Evaluation l oo
Minimum Description Length Principle 1 ()1
MDL-Based Model Evaluation 102
Feature Selection 105
Classes-to-Clusters Evaluation 106
Precision, Recall, and F-Measure 108
Entropy 111
References 112
Exercises 112
5 CLASSIFICATION 115
General Setting and Evaluation Techniques 115
Nearest-Neighbor Algorithm 118
Feature Selection 121
Naive Bayes Algorithm 125
Numerical Approaches 131
Relational Learning 133
References 137
Exercises 138
PART III
WEB USAGE MINING
6 INTRODUCTION TO WEB USAGE MINING 143
Definition of Web Usage Mining 143
Cross-Industry Standard Process for Data Mining 144
Clickstream Analysis 147
CONTENTS ix
Web Server Log Files 148
Remote Host Field 149
Date/Time Field 149
HTTP Request Field 149
Status Code Field 150
Transfer Volume (Bytes) Field 151
Common Log Format 151
Identification Field 151
Authuser Field 151
Extended Common Log Format 151
Referrer Field 152
User Agent Field 152
Example of a Web Log Record 152
Microsoft IIS Log Format 153
Auxiliary Information 154
References 154
Exercises 154
7 PREPROCESSING FOR WEB USAGE MINING 156
Need for Preprocessing the Data 156
Data Cleaning and Filtering 158
Page Extension Exploration and Filtering 161
De-Spidering the Web Log File 163
User Identification 164
Session Identification 167
Path Completion 170
Directories and the Basket Transformation 171
Further Data Preprocessing Steps 174
References 174
Exercises 174
8 EXPLORATORY DATA ANALYSIS FOR WEB USAGE MINING 177
Introduction 177
Number of Visit Actions 177
Session Duration 178
Relationship between Visit Actions and Session Duration 181
Average Time per Page 183
Duration for Individual Pages 185
References 18a
Exercises 188
9 MODELING FOR WEB USAGE MINING: CLUSTERING,
ASSOCIATION, AND CLASSIFICATION 191
Introduction 19
Modeling Methodology 192
Definition of Clustering 193
The BIRCH Clustering Algorithm 194
Affinity Analysis and the A Priori Algorithm 197
X CONTENTS
Discretizing the Numerical Variables: Binning 199
Applying the A Priori Algorithm to the CCSU Web Log Data 201
Classification and Regression Trees 204
The C4.5 Algorithm 208
References 210
Exercises 211
INDEX 213
|
adam_txt |
CONTENTS
PREFACE xi
PART I
WEB STRUCTURE MINING
1 INFORMATION RETRIEVAL AND WEB SEARCH 3
Web Challenges 3
Web Search Engines 4
Topic Directories 5
Semantic Web 5
Crawling the Web 6
Web Basics 6
Web Crawlers 7
Indexing and Keyword Search 13
Document Representation 15
Implementation Considerations 19
Relevance Ranking 20
Advanced Text Search 28
Using the HTML Structure in Keyword Search 30
Evaluating Search Quality 32
Similarity Search 36
Cosine Similarity 36
Jaccard Similarity 38
Document Resemblance 41
References 43
Exercises 43
2 HYPERLINK-BASED RANKING 47
Introduction 47
Social Networks Analysis 48
PageRank 50
Authorities and Hubs 53
Link-Based Similarity Search 55
Enhanced Techniques for Page Ranking 56
References 57
Exercises 57
vii
Viii CONTENTS
PART II
WEB CONTENT MINING
3 CLUSTERING 61
Introduction 61
Hierarchical Agglomerative Clustering 63
A Means Clustering 69
Probabilty-Based Clustering 73
Finite Mixture Problem 74
Classification Problem 76
Clustering Problem 78
Collaborative Filtering (Recommender Systems) 84
References 86
Exercises 86
4 EVALUATING CLUSTERING 89
Approaches to Evaluating Clustering 89
Similarity-Based Criterion Functions 90
Probabilistic Criterion Functions 95
MDL-Based Model and Feature Evaluation l oo
Minimum Description Length Principle 1 ()1
MDL-Based Model Evaluation 102
Feature Selection 105
Classes-to-Clusters Evaluation 106
Precision, Recall, and F-Measure 108
Entropy 111
References 112
Exercises 112
5 CLASSIFICATION 115
General Setting and Evaluation Techniques 115
Nearest-Neighbor Algorithm 118
Feature Selection 121
Naive Bayes Algorithm 125
Numerical Approaches 131
Relational Learning 133
References 137
Exercises 138
PART III
WEB USAGE MINING
6 INTRODUCTION TO WEB USAGE MINING 143
Definition of Web Usage Mining 143
Cross-Industry Standard Process for Data Mining 144
Clickstream Analysis 147
CONTENTS ix
Web Server Log Files 148
Remote Host Field 149
Date/Time Field 149
HTTP Request Field 149
Status Code Field 150
Transfer Volume (Bytes) Field 151
Common Log Format 151
Identification Field 151
Authuser Field 151
Extended Common Log Format 151
Referrer Field 152
User Agent Field 152
Example of a Web Log Record 152
Microsoft IIS Log Format 153
Auxiliary Information 154
References 154
Exercises 154
7 PREPROCESSING FOR WEB USAGE MINING 156
Need for Preprocessing the Data 156
Data Cleaning and Filtering 158
Page Extension Exploration and Filtering 161
De-Spidering the Web Log File 163
User Identification 164
Session Identification 167
Path Completion 170
Directories and the Basket Transformation 171
Further Data Preprocessing Steps 174
References 174
Exercises 174
8 EXPLORATORY DATA ANALYSIS FOR WEB USAGE MINING 177
Introduction 177
Number of Visit Actions 177
Session Duration 178
Relationship between Visit Actions and Session Duration 181
Average Time per Page 183
Duration for Individual Pages 185
References 18a
Exercises 188
9 MODELING FOR WEB USAGE MINING: CLUSTERING,
ASSOCIATION, AND CLASSIFICATION 191
Introduction 19'
Modeling Methodology 192
Definition of Clustering 193
The BIRCH Clustering Algorithm 194
Affinity Analysis and the A Priori Algorithm 197
X CONTENTS
Discretizing the Numerical Variables: Binning 199
Applying the A Priori Algorithm to the CCSU Web Log Data 201
Classification and Regression Trees 204
The C4.5 Algorithm 208
References 210
Exercises 211
INDEX 213 |
any_adam_object | 1 |
any_adam_object_boolean | 1 |
author | Markov, Zdravko Larose, Daniel T. |
author_GND | (DE-588)1062529189 |
author_facet | Markov, Zdravko Larose, Daniel T. |
author_role | aut aut |
author_sort | Markov, Zdravko |
author_variant | z m zm d t l dt dtl |
building | Verbundindex |
bvnumber | BV022465187 |
callnumber-first | Q - Science |
callnumber-label | QA76 |
callnumber-raw | QA76.9.D343 |
callnumber-search | QA76.9.D343 |
callnumber-sort | QA 276.9 D343 |
callnumber-subject | QA - Mathematics |
classification_rvk | ST 530 |
ctrlnum | (OCoLC)70823224 (DE-599)BVBBV022465187 |
dewey-full | 005.74 |
dewey-hundreds | 000 - Computer science, information, general works |
dewey-ones | 005 - Computer programming, programs, data, security |
dewey-raw | 005.74 |
dewey-search | 005.74 |
dewey-sort | 15.74 |
dewey-tens | 000 - Computer science, information, general works |
discipline | Informatik |
discipline_str_mv | Informatik |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01925nam a2200481zc 4500</leader><controlfield tag="001">BV022465187</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20081209 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">070614s2007 xxuad|| |||| 00||| eng d</controlfield><datafield tag="010" ind1=" " ind2=" "><subfield code="a">2006025099</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">0471666556</subfield><subfield code="c">cloth</subfield><subfield code="9">0-471-66655-6</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9780471666554</subfield><subfield code="9">978-0-471-66655-4</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)70823224</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV022465187</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">aacr</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="044" ind1=" " ind2=" "><subfield code="a">xxu</subfield><subfield code="c">US</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-20</subfield><subfield code="a">DE-863</subfield><subfield code="a">DE-945</subfield><subfield code="a">DE-473</subfield><subfield code="a">DE-634</subfield><subfield code="a">DE-11</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">QA76.9.D343</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">005.74</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 530</subfield><subfield code="0">(DE-625)143679:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Markov, Zdravko</subfield><subfield code="e">Verfasser</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Data-mining the web</subfield><subfield code="b">uncovering patterns in Web content, structure, and usage</subfield><subfield code="c">Zdravko Markov and Daniel T. Larose</subfield></datafield><datafield tag="246" ind1="1" ind2="3"><subfield code="a">Data mining the web</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Hoboken, NJ</subfield><subfield code="b">Wiley</subfield><subfield code="c">2007</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XVI, 218 S.</subfield><subfield code="b">Ill., graph. Darst.</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="0" ind2=" "><subfield code="a">Wiley series on methods and applications in data mining</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Bases de données sur le Web</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Exploration de données (Informatique)</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data mining</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Web databases</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Larose, Daniel T.</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1062529189</subfield><subfield code="4">aut</subfield></datafield><datafield tag="856" ind1="4" ind2=" "><subfield code="u">http://www.loc.gov/catdir/enhancements/fy0740/2006025099-b.html</subfield><subfield code="3">Beschreibung für Leser</subfield></datafield><datafield tag="856" ind1="4" ind2=" "><subfield code="u">http://www.loc.gov/catdir/toc/ecip0618/2006025099.html</subfield><subfield code="3">Table of contents only</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">HBZ Datenaustausch</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=015672793&sequence=000004&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-015672793</subfield></datafield></record></collection> |
id | DE-604.BV022465187 |
illustrated | Illustrated |
index_date | 2024-07-02T17:42:04Z |
indexdate | 2024-08-01T11:26:47Z |
institution | BVB |
isbn | 0471666556 9780471666554 |
language | English |
lccn | 2006025099 |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-015672793 |
oclc_num | 70823224 |
open_access_boolean | |
owner | DE-20 DE-863 DE-BY-FWS DE-945 DE-473 DE-BY-UBG DE-634 DE-11 |
owner_facet | DE-20 DE-863 DE-BY-FWS DE-945 DE-473 DE-BY-UBG DE-634 DE-11 |
physical | XVI, 218 S. Ill., graph. Darst. |
publishDate | 2007 |
publishDateSearch | 2007 |
publishDateSort | 2007 |
publisher | Wiley |
record_format | marc |
series2 | Wiley series on methods and applications in data mining |
spellingShingle | Markov, Zdravko Larose, Daniel T. Data-mining the web uncovering patterns in Web content, structure, and usage Bases de données sur le Web Exploration de données (Informatique) Data mining Web databases Data Mining (DE-588)4428654-5 gnd |
subject_GND | (DE-588)4428654-5 |
title | Data-mining the web uncovering patterns in Web content, structure, and usage |
title_alt | Data mining the web |
title_auth | Data-mining the web uncovering patterns in Web content, structure, and usage |
title_exact_search | Data-mining the web uncovering patterns in Web content, structure, and usage |
title_exact_search_txtP | Data-mining the web uncovering patterns in Web content, structure, and usage |
title_full | Data-mining the web uncovering patterns in Web content, structure, and usage Zdravko Markov and Daniel T. Larose |
title_fullStr | Data-mining the web uncovering patterns in Web content, structure, and usage Zdravko Markov and Daniel T. Larose |
title_full_unstemmed | Data-mining the web uncovering patterns in Web content, structure, and usage Zdravko Markov and Daniel T. Larose |
title_short | Data-mining the web |
title_sort | data mining the web uncovering patterns in web content structure and usage |
title_sub | uncovering patterns in Web content, structure, and usage |
topic | Bases de données sur le Web Exploration de données (Informatique) Data mining Web databases Data Mining (DE-588)4428654-5 gnd |
topic_facet | Bases de données sur le Web Exploration de données (Informatique) Data mining Web databases Data Mining |
url | http://www.loc.gov/catdir/enhancements/fy0740/2006025099-b.html http://www.loc.gov/catdir/toc/ecip0618/2006025099.html http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=015672793&sequence=000004&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT markovzdravko dataminingthewebuncoveringpatternsinwebcontentstructureandusage AT larosedanielt dataminingthewebuncoveringpatternsinwebcontentstructureandusage AT markovzdravko dataminingtheweb AT larosedanielt dataminingtheweb |
Inhaltsverzeichnis
THWS Würzburg Zentralbibliothek Lesesaal
Signatur: |
1000 ST 530 M346 |
---|---|
Exemplar 1 | ausleihbar Verfügbar Bestellen |