Mining the Web: discovering knowledge from hypertext data
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Amsterdam [u.a.]
Morgan Kaufmann Publ.
2007
|
Ausgabe: | [Nachdr.] |
Schriftenreihe: | The Morgan Kaufmann series in data management systems
|
Schlagworte: | |
Online-Zugang: | Publisher description Table of contents Inhaltsverzeichnis |
Beschreibung: | Includes bibliographical references (p. 307-326) and index |
Beschreibung: | XVIII, 345 S. Ill., graph. Darst. |
ISBN: | 1558607544 9781558607545 |
Internformat
MARC
LEADER | 00000nam a2200000zc 4500 | ||
---|---|---|---|
001 | BV036132174 | ||
003 | DE-604 | ||
005 | 20110616 | ||
007 | t | ||
008 | 100422s2007 xxuad|| |||| 00||| eng d | ||
016 | 7 | |a ocn263706453 |2 DE-101 | |
020 | |a 1558607544 |9 1-55860-754-4 | ||
020 | |a 9781558607545 |9 978-1-55860-754-5 | ||
035 | |a (OCoLC)263706453 | ||
035 | |a (DE-599)BVBBV036132174 | ||
040 | |a DE-604 |b ger |e aacr | ||
041 | 0 | |a eng | |
044 | |a xxu |c US | ||
049 | |a DE-473 |a DE-945 | ||
050 | 0 | |a QA76.9.D343 C45 2007 | |
082 | 0 | |a 005.72 | |
082 | 0 | |a 005.78/8 22 | |
084 | |a ST 270 |0 (DE-625)143638: |2 rvk | ||
084 | |a ST 530 |0 (DE-625)143679: |2 rvk | ||
100 | 1 | |a Chakrabarti, Soumen |e Verfasser |4 aut | |
245 | 1 | 0 | |a Mining the Web |b discovering knowledge from hypertext data |c Soumen Chakrabarti |
250 | |a [Nachdr.] | ||
264 | 1 | |a Amsterdam [u.a.] |b Morgan Kaufmann Publ. |c 2007 | |
300 | |a XVIII, 345 S. |b Ill., graph. Darst. | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
490 | 0 | |a The Morgan Kaufmann series in data management systems | |
500 | |a Includes bibliographical references (p. 307-326) and index | ||
650 | 4 | |a Data mining | |
650 | 4 | |a Hypertext systems | |
650 | 4 | |a Web databases | |
650 | 4 | |a Automatic data collection | |
650 | 4 | |a Data Mining - World Wide Web | |
650 | 4 | |a Automatic data collection | |
650 | 4 | |a Data mining | |
650 | 4 | |a Hypertext systems | |
650 | 4 | |a Web databases | |
650 | 0 | 7 | |a World Wide Web |0 (DE-588)4363898-3 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Data Mining |0 (DE-588)4428654-5 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a World Wide Web |0 (DE-588)4363898-3 |D s |
689 | 0 | 1 | |a Data Mining |0 (DE-588)4428654-5 |D s |
689 | 0 | |C b |5 DE-604 | |
856 | 4 | 2 | |q text/html |u http://www.loc.gov/catdir/description/els031/2002107241.html |3 Publisher description |
856 | 4 | 2 | |q text/html |u http://www.loc.gov/catdir/toc/els031/2002107241.html |3 Table of contents |
856 | 4 | 2 | |m Digitalisierung UB Bamberg |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=020214775&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-020214775 |
Datensatz im Suchindex
_version_ | 1804142806267068416 |
---|---|
adam_text | CONTENTS
Foreword
vii
Jiawei
Han
Preface
xv
INTRODUCTION
1.1
Crawling and Indexing
6
1.2
Topic Directories
7
1.3
Clustering and Classification
8
1.4
Hyperlink Analysis
9
1.5
Resource Discovery and Vertical Portals
11
1.6
Structured vs. Unstructured Data Mining
11
1.7
Bibliographic Notes
13
part
ι
INFRASTRUCTURE
2
CRAWLING THE WEB
2.1
HTML and HTTP Basics
18
2.2
Crawling Basics
19
2.3
Engineering Large-Scale Crawlers
21
2.3.1
DNS Caching, Prefetching, and Resolution
22
2.3.2
Multiple Concurrent Fetches
23
2.3.3
Link Extraction and Normalization
25
ix
X
Contents
2.3.4
Robot Exclusion
26
2.3.5
Eliminating Already-Visited URLs
26
2.3.6
Spider Traps
28
2.3.7
Avoiding Repeated Expansion of Links on Duplicate
Pages
29
2.3.8
Load Monitor and Manager
29
2.3.9
Per-Server Work-Queues
30
2.3.10
Text Repository
31
2.3.11
Refreshing Crawled Pages
33
2.4
Putting Together a Crawler
35
2.4.1
Design of the Core Components
35
2.4.2
Case Study: Using
vrôc-l
i bwww
40
2.5
Bibliographic Notes
40
WEB SEARCH AND INFORMATION RETRIEVAL
3.1
Boolean Queries and the Inverted Index
45
3.1.1
Stopwords and Stemming
48
3.1.2
Batch Indexing and Updates
49
3.1.3
Index Compression Techniques
51
3.2
Relevance Ranking
53
3.2.1
Recall and Precision
53
3.2.2
The Vector-Space Model
56
3.2.3
Relevance Feedback and
Rocchio
s
Method
57
3.2.4
Probabilistic Relevance Feedback Models
58
3.2.5
Advanced Issues
61
3.3
Similarity Search
67
3.3.1
Handling Find-Similar Queries
68
3.3.2
Eliminating Near Duplicates via Shingling
71
3.3.3
Detecting Locally Similar Subgraphs of the Web
73
3.4
Bibliographic Notes
75
Contents Xl
PART II
LEARNING
4
SIMILARITY AND CLUSTERING
4.1
Formulations and Approaches
81
4.1.1
Partitioning Approaches
81
4.1.2
Geometric Embedding Approaches
82
4.1.3
Generative Models and Probabilistic Approaches
83
4.2
Bottom-Up and
Тор
-Down
Partitioning Paradigms
84
4.2.1
Agglomerative Clustering
84
4.2.2
The fe-Means Algorithm
87
4.3
Clustering and Visualization via Embeddings
89
4.3.1
Self-Organizing Maps (SOMs)
90
4.3.2
Multidimensional Scaling (MDS) and FastMap
91
4.3.3
Projections and Subspaces
94
4.3.4
Latent Semantic Indexing (LSI)
96
4.4
Probabilistic Approaches to Clustering
99
4.4.1
Generative Distributions for Documents
101
4.4.2
Mixture Models and Expectation Maximization (EM)
103
4.4.3
Multiple Cause Mixture Model (MCMM)
108
4.4.4
Aspect Models and Probabilistic LSI
109
4.4.5
Model and Feature Selection
112
4.5
Collaborative Filtering
115
4.5.1
Probabilistic Models
115
4.5.2
Combining Content-Based and Collaborative
Features
117
4.6
Bibliographic Notes
121
5
SUPERVISED LEARNING
5.1
The Supervised Learning Scenario
126
5.2
Overview of Classification Strategies
128
Xli Contents
5.3
Evaluating Text Classifiers
129
5.3.1 Benchmarks 130
5.3.2
Measures of Accuracy
131
5.4
Nearest Neighbor Learners
133
5.4.1
Pros and Cons
134
5.4.2
Is TFIDF Appropriate?
135
5.5
Feature Selection
136
5.5.1
Greedy Inclusion Algorithms
137
5.5.2
Truncation Algorithms
144
5.5.3
Comparison and Discussion
145
5.6
Bayesian Learners
147
5.6.1
Naive
Bayes
Learners
148
5.6.2
Small-Degree Bayesian Networks
152
5.7
Exploiting Hierarchy among Topics
155
5.7.1
Feature Selection
155
5.7.2
Enhanced Parameter Estimation
155
5.7.3
Training and Search Strategies
157
5.8
Maximum Entropy Learners
160
5.9
Discriminative Classification
163
5.9.1
Linear Least-Square Regression
163
5.9.2
Support Vector Machines
164
5.10
Hypertext Classification
169
5.10.1
Representing Hypertext for Supervised Learning
169
5.10.2
Rule Induction
171
5.11
Bibliographic Notes
173
SEMISUPERVISED LEARNING
6.1
Expectation Maximization
178
6.1.1
Experimental Results
179
6.1.2
Reducing the Belief in Unlabeled Documents
181
6.1.3
Modeling Labels Using Many Mixture Components
183
Contents Xlii
6.2
Labeling
Hypertext
Graphs
184
6.2.1
Absorbing Features from Neighboring Pages
185
6.2.2
A Relaxation Labeling Algorithm
188
6.2.3
A Metric Graph-Labeling Problem
193
6.3
Co-training
195
6.4
Bibliographic Notes
198
part in APPLICATIONS
7
SOCIAL NETWORK ANALYSIS
7.1
Social Sciences and Bibliometry
205
7.1.1
Prestige
205
7.1.2
Centrality
206
7.1.3
Co-citation
207
7.2
PageRank and HITS
209
7.2.1
PageRank
209
7.2.2
HITS
212
7.2.3
Stochastic HITS and Other Variants
216
7.3
Shortcomings of the Coarse-Grained Graph Model
219
7.3.1
Artifacts of Web Authorship
219
7.3.2
Topic Contamination and Drift
223
7.4
Enhanced Models and Techniques
225
7.4.1
Avoiding Two-Party Nepotism
225
7.4.2
Outlier Elimination
226
7.4.3
Exploiting Anchor Text
227
7.4.4
Exploiting Document Markup Structure
228
7.5
Evaluation of Topic Distillation
235
7.5.1
HITS and Related Algorithms
235
7.5.2
Effect of Exploiting Other Hypertext Features
238
7.6
Measuring and Modeling the Web
243
7.6.1
Power-Law Degree Distributions
243
XIV Contents
7.6.2
The Bow Tie Structure and Bipartite Cores
246
7.6.3
Sampling Web Pages at Random
246
7.7
Bibliographic Notes
254
RESOURCE DISCOVERY
8.1
Collecting Important Pages Preferentially
257
8.1.1
Crawling as Guided Search in a Graph
257
8.1.2
Keyword-Based Graph Search
259
8.2
Similarity Search Using Link Topology
264
8.3
Topical Locality and Focused Crawling
268
8.3.1
Focused Crawling
270
8.3.2
Identifying and Exploiting Hubs
277
8.3.3
Learning Context Graphs
279
8.3.4
Reinforcement Learning
280
8.4
Discovering Communities
284
8.4.1
Bipartite Cores as Communities
284
8.4.2
Network Flow/Cut-Based Notions of Communities
285
8.5
Bibliographic Notes
288
THE FUTURE OF WEB MINING
9.1
Information Extraction
290
9.2
Natural Language Processing
295
9.2.1
Lexical Networks and Ontologies
296
9.2.2
Part-of-Speech and Sense Tagging
297
9.2.3
Parsing and Knowledge Representation
299
9.3
Question Answering
302
9.4
Profiles, Personalization, and Collaboration
305
References
307
Index
327
About the Author
345
|
any_adam_object | 1 |
author | Chakrabarti, Soumen |
author_facet | Chakrabarti, Soumen |
author_role | aut |
author_sort | Chakrabarti, Soumen |
author_variant | s c sc |
building | Verbundindex |
bvnumber | BV036132174 |
callnumber-first | Q - Science |
callnumber-label | QA76 |
callnumber-raw | QA76.9.D343 C45 2007 |
callnumber-search | QA76.9.D343 C45 2007 |
callnumber-sort | QA 276.9 D343 C45 42007 |
callnumber-subject | QA - Mathematics |
classification_rvk | ST 270 ST 530 |
ctrlnum | (OCoLC)263706453 (DE-599)BVBBV036132174 |
dewey-full | 005.72 005.78/822 |
dewey-hundreds | 000 - Computer science, information, general works |
dewey-ones | 005 - Computer programming, programs, data, security |
dewey-raw | 005.72 005.78/8 22 |
dewey-search | 005.72 005.78/8 22 |
dewey-sort | 15.72 |
dewey-tens | 000 - Computer science, information, general works |
discipline | Informatik |
edition | [Nachdr.] |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>02262nam a2200589zc 4500</leader><controlfield tag="001">BV036132174</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20110616 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">100422s2007 xxuad|| |||| 00||| eng d</controlfield><datafield tag="016" ind1="7" ind2=" "><subfield code="a">ocn263706453</subfield><subfield code="2">DE-101</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">1558607544</subfield><subfield code="9">1-55860-754-4</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781558607545</subfield><subfield code="9">978-1-55860-754-5</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)263706453</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV036132174</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">aacr</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="044" ind1=" " ind2=" "><subfield code="a">xxu</subfield><subfield code="c">US</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-473</subfield><subfield code="a">DE-945</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">QA76.9.D343 C45 2007</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">005.72</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">005.78/8 22</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 270</subfield><subfield code="0">(DE-625)143638:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 530</subfield><subfield code="0">(DE-625)143679:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Chakrabarti, Soumen</subfield><subfield code="e">Verfasser</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Mining the Web</subfield><subfield code="b">discovering knowledge from hypertext data</subfield><subfield code="c">Soumen Chakrabarti</subfield></datafield><datafield tag="250" ind1=" " ind2=" "><subfield code="a">[Nachdr.]</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Amsterdam [u.a.]</subfield><subfield code="b">Morgan Kaufmann Publ.</subfield><subfield code="c">2007</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XVIII, 345 S.</subfield><subfield code="b">Ill., graph. Darst.</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="0" ind2=" "><subfield code="a">The Morgan Kaufmann series in data management systems</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Includes bibliographical references (p. 307-326) and index</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data mining</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Hypertext systems</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Web databases</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Automatic data collection</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data Mining - World Wide Web</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Automatic data collection</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data mining</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Hypertext systems</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Web databases</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">World Wide Web</subfield><subfield code="0">(DE-588)4363898-3</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">World Wide Web</subfield><subfield code="0">(DE-588)4363898-3</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="C">b</subfield><subfield code="5">DE-604</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="q">text/html</subfield><subfield code="u">http://www.loc.gov/catdir/description/els031/2002107241.html</subfield><subfield code="3">Publisher description</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="q">text/html</subfield><subfield code="u">http://www.loc.gov/catdir/toc/els031/2002107241.html</subfield><subfield code="3">Table of contents</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Bamberg</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=020214775&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-020214775</subfield></datafield></record></collection> |
id | DE-604.BV036132174 |
illustrated | Illustrated |
indexdate | 2024-07-09T22:37:37Z |
institution | BVB |
isbn | 1558607544 9781558607545 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-020214775 |
oclc_num | 263706453 |
open_access_boolean | |
owner | DE-473 DE-BY-UBG DE-945 |
owner_facet | DE-473 DE-BY-UBG DE-945 |
physical | XVIII, 345 S. Ill., graph. Darst. |
publishDate | 2007 |
publishDateSearch | 2007 |
publishDateSort | 2007 |
publisher | Morgan Kaufmann Publ. |
record_format | marc |
series2 | The Morgan Kaufmann series in data management systems |
spelling | Chakrabarti, Soumen Verfasser aut Mining the Web discovering knowledge from hypertext data Soumen Chakrabarti [Nachdr.] Amsterdam [u.a.] Morgan Kaufmann Publ. 2007 XVIII, 345 S. Ill., graph. Darst. txt rdacontent n rdamedia nc rdacarrier The Morgan Kaufmann series in data management systems Includes bibliographical references (p. 307-326) and index Data mining Hypertext systems Web databases Automatic data collection Data Mining - World Wide Web World Wide Web (DE-588)4363898-3 gnd rswk-swf Data Mining (DE-588)4428654-5 gnd rswk-swf World Wide Web (DE-588)4363898-3 s Data Mining (DE-588)4428654-5 s b DE-604 text/html http://www.loc.gov/catdir/description/els031/2002107241.html Publisher description text/html http://www.loc.gov/catdir/toc/els031/2002107241.html Table of contents Digitalisierung UB Bamberg application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=020214775&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Chakrabarti, Soumen Mining the Web discovering knowledge from hypertext data Data mining Hypertext systems Web databases Automatic data collection Data Mining - World Wide Web World Wide Web (DE-588)4363898-3 gnd Data Mining (DE-588)4428654-5 gnd |
subject_GND | (DE-588)4363898-3 (DE-588)4428654-5 |
title | Mining the Web discovering knowledge from hypertext data |
title_auth | Mining the Web discovering knowledge from hypertext data |
title_exact_search | Mining the Web discovering knowledge from hypertext data |
title_full | Mining the Web discovering knowledge from hypertext data Soumen Chakrabarti |
title_fullStr | Mining the Web discovering knowledge from hypertext data Soumen Chakrabarti |
title_full_unstemmed | Mining the Web discovering knowledge from hypertext data Soumen Chakrabarti |
title_short | Mining the Web |
title_sort | mining the web discovering knowledge from hypertext data |
title_sub | discovering knowledge from hypertext data |
topic | Data mining Hypertext systems Web databases Automatic data collection Data Mining - World Wide Web World Wide Web (DE-588)4363898-3 gnd Data Mining (DE-588)4428654-5 gnd |
topic_facet | Data mining Hypertext systems Web databases Automatic data collection Data Mining - World Wide Web World Wide Web Data Mining |
url | http://www.loc.gov/catdir/description/els031/2002107241.html http://www.loc.gov/catdir/toc/els031/2002107241.html http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=020214775&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT chakrabartisoumen miningthewebdiscoveringknowledgefromhypertextdata |