Verfügbarkeit: Web data mining

Web data mining: exploring hyperlinks, contents, and usage data

Gespeichert in:

Bibliographische Detailangaben
1. Verfasser:	Liu, Bing (VerfasserIn)
Format:	Buch
Sprache:	English
Veröffentlicht:	Berlin [u.a.] Springer 2007
Schriftenreihe:	Data-centric systems and applications
Schlagworte:	Data Mining World Wide Web
Online-Zugang:	Inhaltstext Inhaltsverzeichnis
Beschreibung:	Literaturverz. S. [486] - 515
Beschreibung:	XIX, 532 S. Ill., graph. Darst.
ISBN:	9783540378815 3540378812

Internformat

MARC


LEADER	00000nam a2200000 c 4500
001	BV021759541
003	DE-604
005	20081023
007	t
008	061009s2007 gw ad\|\| \|\|\|\| 00\|\|\| eng d
015			\|a 06,N36,0032 \|2 dnb
016	7		\|a 980843286 \|2 DE-101
020			\|a 9783540378815 \|c Gb. : ca. EUR 58.80 (freier Pr.), ca. sfr 97.50 (freier Pr.) \|9 978-3-540-37881-5
020			\|a 3540378812 \|c Gb. : ca. EUR 58.80 (freier Pr.), ca. sfr 97.50 (freier Pr.) \|9 3-540-37881-2
024	3		\|a 9783540378815
028	5	2	\|a 11415190
035			\|a (OCoLC)180943345
035			\|a (DE-599)BVBBV021759541
040			\|a DE-604 \|b ger \|e rakddb
041	0		\|a eng
044			\|a gw \|c XA-DE-BE
049			\|a DE-384 \|a DE-824 \|a DE-703 \|a DE-355 \|a DE-19 \|a DE-634 \|a DE-83
082	0		\|a 005.72 \|2 22/ger
082	0		\|a 006.312 \|2 22/ger
084			\|a ST 205 \|0 (DE-625)143613: \|2 rvk
084			\|a ST 530 \|0 (DE-625)143679: \|2 rvk
084			\|a 004 \|2 sdnb
100	1		\|a Liu, Bing \|e Verfasser \|4 aut
245	1	0	\|a Web data mining \|b exploring hyperlinks, contents, and usage data \|c Bing Liu
264		1	\|a Berlin [u.a.] \|b Springer \|c 2007
300			\|a XIX, 532 S. \|b Ill., graph. Darst.
336			\|b txt \|2 rdacontent
337			\|b n \|2 rdamedia
338			\|b nc \|2 rdacarrier
490	0		\|a Data-centric systems and applications
500			\|a Literaturverz. S. [486] - 515
650	0	7	\|a Data Mining \|0 (DE-588)4428654-5 \|2 gnd \|9 rswk-swf
650	0	7	\|a World Wide Web \|0 (DE-588)4363898-3 \|2 gnd \|9 rswk-swf
689	0	0	\|a World Wide Web \|0 (DE-588)4363898-3 \|D s
689	0	1	\|a Data Mining \|0 (DE-588)4428654-5 \|D s
689	0		\|5 DE-604
856	4	2	\|q text/html \|u http://deposit.dnb.de/cgi-bin/dokserv?id=2844145&prov=M&dok_var=1&dok_ext=htm \|3 Inhaltstext
856	4	2	\|m HBZ Datenaustausch \|q application/pdf \|u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=014972637&sequence=000003&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA \|3 Inhaltsverzeichnis
943	1		\|a oai:aleph.bib-bvb.de:BVB01-014972637

Datensatz im Suchindex

_version_	1805088442941440000
adam_text	Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining? 6 1.4. Summary of Chapters 8 1.5. How to Read this Book 11 Bibliographic Notes 12 Part I: Data Mining Foundations 2. Association Rules and Sequential Patterns 13 2.1. Basic Concepts of Association Rules 13 2.2. Apriori Algorithm 16 2.2.1. Frequent Itemset Generation 16 2.2.2 Association Rule Generation 20 2.3. Data Formats for Association Rule Mining 22 2.4. Mining with Multiple Minimum Supports 22 2.4.1 Extended Model 24 2.4.2. Mining Algorithm 26 2.4.3. Rule Generation 31 2.5. Mining Class Association Rules 32 2.5.1. Problem Definition 32 2.5.2. Mining Algorithm 34 2.5.3. Mining with Multiple Minimum Supports 37 XII Table of Contents 2.6. Basic Concepts of Sequential Patterns 37 2.7. Mining Sequential Patterns Based on GSP 39 2.7.1. GSP Algorithm 39 2.7.2. Mining with Multiple Minimum Supports 41 2.8. Mining Sequential Patterns Based on PrefixSpan 45 2.8.1. PrefixSpan Algorithm 46 2.8.2. Mining with Multiple Minimum Supports 48 2.9. Generating Rules from Sequential Patterns 49 2.9.1. Sequential Rules 50 2.9.2. Label Sequential Rules 50 2.9.3. Class Sequential Rules 51 Bibliographic Notes 52 3. Supervised Learning 55 3.1. Basic Concepts 55 3.2. Decision Tree Induction 59 3.2.1. Learning Algorithm 62 3.2.2. Impurity Function 63 3.2.3. Handling of Continuous Attributes 67 3.2.4. Some Other Issues 68 3.3. Classifier Evaluation 71 3.3.1. Evaluation Methods 71 3.3.2. Precision, Recall, F score and Breakeven Point 73 3.4. Rule Induction 75 3.4.1. Sequential Covering 75 3.4.2. Rule Learning: Leam One Rule Function 78 3.4.3. Discussion 81 3.5. Classification Based on Associations 81 3.5.1. Classification Using Class Association Rules 82 3.5.2. Class Association Rules as Features 86 3.5.3. Classification Using Normal Association Rules 86 3.6. Naive Bayesian Classification 87 3.7. Naive Bayesian Text Classification 91 3.7.1. Probabilistic Framework 92 3.7.2. Naive Bayesian Model 93 3.7.3. Discussion 96 3.8. Support Vector Machines 97 3.8.1. Linear SVM: Separable Case 99 Table of Contents XIII 3.8.2. Linear SVM: Non Separable Case 105 3.8.3. Nonlinear SVM: Kernel Functions 108 3.9. K Nearest Neighbor Learning 112 3.10. Ensemble of Classifiers 113 3.10.1. Bagging 114 3.10.2. Boosting 114 Bibliographic Notes 115 4. Unsupervised Learning 117 4.1. Basic Concepts 117 4.2. K means Clustering 120 4.2.1. K means Algorithm 120 4.2.2. Disk Version of the K means Algorithm 123 4.2.3. Strengths and Weaknesses 124 4.3. Representation of Clusters 128 4.3.1. Common Ways of Representing Clusters 129 4.3.2 Clusters of Arbitrary Shapes 130 4.4. Hierarchical Clustering 131 4.4.1. Single Link Method 133 4.4.2. Complete Link Method 133 4.4.3. Average Link Method 134 4.4.4. Strengths and Weaknesses 134 4.5. Distance Functions 135 4.5.1. Numeric Attributes 135 4.5.2. Binary and Nominal Attributes 136 4.5.3. Text Documents 138 4.6. Data Standardization 139 4.7. Handling of Mixed Attributes 141 4.8. Which Clustering Algorithm to Use? 143 4.9. Cluster Evaluation 143 4.10. Discovering Holes and Data Regions 146 Bibliographic Notes 149 5. Partially Supervised Learning 151 5.1. Learning from Labeled and Unlabeled Examples 151 5.1.1. EM Algorithm with Naive Bayesian Classification ¦ 153 XIV Table of Contents 5.1.2. Co Training 156 5.1.3. Self Training 158 5.1.4. Transductive Support Vector Machines 159 5.1.5. Graph Based Methods 160 5.1.6. Discussion 164 5.2. Learning from Positive and Unlabeled Examples 165 5.2.1. Applications of PU Learning 165 5.2.2. Theoretical Foundation 168 5.2.3. Building Classifiers: Two Step Approach 169 5.2.4. Building Classifiers: Direct Approach 175 5.2.5. Discussion 178 Appendix: Derivation of EM for Naive Bayesian Classification •¦ 179 Bibliographic Notes 181 Part II: Web Mining 6. Information Retrieval and Web Search 183 6.1. Basic Concepts of Information Retrieval 184 6.2. Information Retrieval Models 187 6.2.1. Boolean Model 188 6.2.2. Vector Space Model 188 6.2.3. Statistical Language Model 191 6.3. Relevance Feedback 192 6.4. Evaluation Measures 195 6.5. Text and Web Page Pre Processing 199 6.5.1. Stopword Removal 199 6.5.2. Stemming 200 6.5.3. Other Pre Processing Tasks for Text 200 6.5.4. Web Page Pre Processing 201 6.5.5. Duplicate Detection 203 6.6. Inverted Index and Its Compression 204 6.6.1. Inverted Index 204 6.6.2. Search Using an Inverted Index 206 6.6.3. Index Construction 207 6.6.4. Index Compression 209 Table of Contents XV 6.7. Latent Semantic Indexing 215 6.7.1. Singular Value Decomposition 215 6.7.2. Query and Retrieval 218 6.7.3. An Example 219 6.7.4. Discussion 221 6.8. Web Search 222 6.9. Meta Search: Combining Multiple Rankings 225 6.9.1. Combination Using Similarity Scores 226 6.9.2. Combination Using Rank Positions 227 6.10. Web Spamming 229 6.10.1. Content Spamming 230 6.10.2. Link Spamming 231 6.10.3. Hiding Techniques 233 6.10.4. Combating Spam 234 Bibliographic Notes 235 7. Link Analysis 237 7.1. Social Network Analysis 238 7.1.1 Centrality 238 7.1.2 Prestige 241 7.2. Co Citation and Bibliographic Coupling 243 7.2.1. Co Citation 244 7.2.2. Bibliographic Coupling 245 7.3. PageRank 245 7.3.1. PageRank Algorithm 246 7.3.2. Strengths and Weaknesses of PageRank 253 7.3.3. Timed PageRank 254 7.4. HITS 255 7.4.1. HITS Algorithm 256 7.4.2. Finding Other Eigenvectors 259 7.4.3. Relationships with Co Citation and Bibliographic Coupling 259 7.4.4. Strengths and Weaknesses of HITS 260 7.5. Community Discovery 261 7.5.1. Problem Definition 262 7.5.2. Bipartite Core Communities 264 7.5.3. Maximum Flow Communities 265 7.5.4. Email Communities Based on Betweenness 268 7.5.5. Overlapping Communities of Named Entities 270 XVI Table of Contents Bibliographic Notes 271 8. Web Crawling 273 8.1. A Basic Crawler Algorithm 274 8.1.1. Breadth First Crawlers 275 8.1.2. Preferential Crawlers 276 8.2. Implementation Issues 277 8.2.1. Fetching 277 8.2.2. Parsing 278 8.2.3. Stopword Removal and Stemming 280 8.2.4. Link Extraction and Canonicalization 280 8.2.5. Spider Traps 282 8.2.6. Page Repository 283 8.2.7. Concurrency 284 8.3. Universal Crawlers 285 8.3.1. Scalability 286 8.3.2. Coverage vs Freshness vs Importance 288 8.4. Focused Crawlers 289 8.5. Topical Crawlers 292 8.5.1. Topical Locality and Cues 294 8.5.2. Best First Variations 300 8.5.3. Adaptation 303 8.6. Evaluation 310 8.7. Crawler Ethics and Conflicts 315 8.8. Some New Developments 318 Bibliographic Notes 320 9. Structured Data Extraction: Wrapper Generation 323 9.1 Preliminaries 324 9.1.1. Two Types of Data Rich Pages 324 9.1.2. Data Model 326 9.1.3. HTML Mark Up Encoding of Data Instances 328 9.2. Wrapper Induction 330 9.2.1. Extraction from a Page 330 9.2.2. Learning Extraction Rules 333 9.2.3. Identifying Informative Examples 337 9.2.4. Wrapper Maintenance 338 Table of Contents XVII 9.3. Instance Based Wrapper Learning 338 9.4. Automatic Wrapper Generation: Problems 341 9.4.1. Two Extraction Problems 342 9.4.2. Patterns as Regular Expressions 343 9.5. String Matching and Tree Matching 344 9.5.1. String Edit Distance 344 9.5.2. Tree Matching 346 9.6. Multiple Alignment 350 9.6.1. Center Star Method 350 9.6.2. Partial Tree Alignment 351 9.7. Building DOM Trees 356 9.8. Extraction Based on a Single List Page: Flat Data Records 357 9.8.1. Two Observations about Data Records 358 9.8.2. Mining Data Regions 359 9.8.3. Identifying Data Records in Data Regions 364 9.8.4. Data Item Alignment and Extraction 365 9.8.5. Making Use of Visual Information 366 9.8.6. Some Other Techniques 366 9.9. Extraction Based on a Single List Page: Nested Data Records 367 9.10. Extraction Based on Multiple Pages 373 9.10.1. Using Techniques in Previous Sections 373 9.10.2. RoadRunner Algorithm 374 9.11. Some Other Issues 375 9.11.1. Extraction from Other Pages 375 9.11.2. Disjunction or Optional 376 9.11.3. A Set Type or a Tuple Type 377 9.11.4. Labeling and Integration 378 9.11.5. Domain Specific Extraction 378 9.12. Discussion 379 Bibliographic Notes 379 10. Information Integration 381 10.1. Introduction to Schema Matching 382 10.2. Pre Processing for Schema Matching 384 10.3. Schema Level Match 385 XVIII Table of Contents 10.3.1. Linguistic Approaches 385 10.3.2. Constraint Based Approaches 386 10.4. Domain and Instance Level Matching 387 10.5. Combining Similarities 390 10.6. 1:m Match 391 10.7. Some Other Issues 392 10.7.1. Reuse of Previous Match Results 392 10.7.2. Matching a Large Number of Schemas 393 10.7.3 Schema Match Results 393 10.7.4 User Interactions 394 10.8. Integration of Web Query Interfaces 394 10.8.1. A Clustering Based Approach 397 10.8.2. A Correlation Based Approach 400 10.8.3. An Instance Based Approach 403 10.9. Constructing a Unified Global Query Interface 406 10.9.1. Structural Appropriateness and the Merge Algorithm 406 10.9.2. Lexical Appropriateness 408 10.9.3. Instance Appropriateness 409 Bibliographic Notes 410 11. Opinion Mining 411 11.1. Sentiment Classification 412 11.1.1. Classification Based on Sentiment Phrases 413 11.1.2. Classification Using Text Classification Methods ¦ 415 11.1.3. Classification Using a Score Function 416 11.2. Feature Based Opinion Mining and Summarization 417 11.2.1. Problem Definition 418 11.2.2. Object Feature Extraction 424 11.2.3. Feature Extraction from Pros and Cons of Format 1 425 11.2.4. Feature Extraction from Reviews of of Formats 2 and 3 429 11.2.5. Opinion Orientation Classification 430 11.3. Comparative Sentence and Relation Mining 432 11.3.1. Problem Definition 433 11.3.2. Identification of Gradable Comparative Sentences 435 Table of Contents XIX 11.3.3. Extraction of Comparative Relations 437 11.4. Opinion Search 439 11.5. Opinion Spam 441 11.5.1. Objectives and Actions of Opinion Spamming 441 11.5.2. Types of Spam and Spammers 442 11.5.3. Hiding Techniques 443 11.5.4. Spam Detection 444 Bibliographic Notes 446 12. Web Usage Mining 449 12.1. Data Collection and Pre Processing 450 12.1.1 Sources and Types of Data 452 12.1.2 Key Elements of Web Usage Data Pre Processing 455 12.2 Data Modeling for Web Usage Mining 462 12.3 Discovery and Analysis of Web Usage Patterns 466 12.3.1. Session and Visitor Analysis 466 12.3.2. Cluster Analysis and Visitor Segmentation 467 12.3.3 Association and Correlation Analysis 471 12.3.4 Analysis of Sequential and Navigational Patterns 475 12.3.5. Classification and Prediction Based on Web User Transactions 479 12.4. Discussion and Outlook 482 Bibliographic Notes 482 References 485 Index 517
adam_txt	Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining? 6 1.4. Summary of Chapters 8 1.5. How to Read this Book 11 Bibliographic Notes 12 Part I: Data Mining Foundations 2. Association Rules and Sequential Patterns 13 2.1. Basic Concepts of Association Rules 13 2.2. Apriori Algorithm 16 2.2.1. Frequent Itemset Generation 16 2.2.2 Association Rule Generation 20 2.3. Data Formats for Association Rule Mining 22 2.4. Mining with Multiple Minimum Supports 22 2.4.1 Extended Model 24 2.4.2. Mining Algorithm 26 2.4.3. Rule Generation 31 2.5. Mining Class Association Rules 32 2.5.1. Problem Definition 32 2.5.2. Mining Algorithm 34 2.5.3. Mining with Multiple Minimum Supports 37 XII Table of Contents 2.6. Basic Concepts of Sequential Patterns 37 2.7. Mining Sequential Patterns Based on GSP 39 2.7.1. GSP Algorithm 39 2.7.2. Mining with Multiple Minimum Supports 41 2.8. Mining Sequential Patterns Based on PrefixSpan 45 2.8.1. PrefixSpan Algorithm 46 2.8.2. Mining with Multiple Minimum Supports 48 2.9. Generating Rules from Sequential Patterns 49 2.9.1. Sequential Rules 50 2.9.2. Label Sequential Rules 50 2.9.3. Class Sequential Rules 51 Bibliographic Notes 52 3. Supervised Learning 55 3.1. Basic Concepts 55 3.2. Decision Tree Induction 59 3.2.1. Learning Algorithm 62 3.2.2. Impurity Function 63 3.2.3. Handling of Continuous Attributes 67 3.2.4. Some Other Issues 68 3.3. Classifier Evaluation 71 3.3.1. Evaluation Methods 71 3.3.2. Precision, Recall, F score and Breakeven Point 73 3.4. Rule Induction 75 3.4.1. Sequential Covering 75 3.4.2. Rule Learning: Leam One Rule Function 78 3.4.3. Discussion 81 3.5. Classification Based on Associations 81 3.5.1. Classification Using Class Association Rules 82 3.5.2. Class Association Rules as Features 86 3.5.3. Classification Using Normal Association Rules 86 3.6. Naive Bayesian Classification 87 3.7. Naive Bayesian Text Classification 91 3.7.1. Probabilistic Framework 92 3.7.2. Naive Bayesian Model 93 3.7.3. Discussion 96 3.8. Support Vector Machines 97 3.8.1. Linear SVM: Separable Case 99 Table of Contents XIII 3.8.2. Linear SVM: Non Separable Case 105 3.8.3. Nonlinear SVM: Kernel Functions 108 3.9. K Nearest Neighbor Learning 112 3.10. Ensemble of Classifiers 113 3.10.1. Bagging 114 3.10.2. Boosting 114 Bibliographic Notes 115 4. Unsupervised Learning 117 4.1. Basic Concepts 117 4.2. K means Clustering 120 4.2.1. K means Algorithm 120 4.2.2. Disk Version of the K means Algorithm 123 4.2.3. Strengths and Weaknesses 124 4.3. Representation of Clusters 128 4.3.1. Common Ways of Representing Clusters 129 4.3.2 Clusters of Arbitrary Shapes 130 4.4. Hierarchical Clustering 131 4.4.1. Single Link Method 133 4.4.2. Complete Link Method 133 4.4.3. Average Link Method 134 4.4.4. Strengths and Weaknesses 134 4.5. Distance Functions 135 4.5.1. Numeric Attributes 135 4.5.2. Binary and Nominal Attributes 136 4.5.3. Text Documents 138 4.6. Data Standardization 139 4.7. Handling of Mixed Attributes 141 4.8. Which Clustering Algorithm to Use? 143 4.9. Cluster Evaluation 143 4.10. Discovering Holes and Data Regions 146 Bibliographic Notes 149 5. Partially Supervised Learning 151 5.1. Learning from Labeled and Unlabeled Examples 151 5.1.1. EM Algorithm with Naive Bayesian Classification ¦ 153 XIV Table of Contents 5.1.2. Co Training 156 5.1.3. Self Training 158 5.1.4. Transductive Support Vector Machines 159 5.1.5. Graph Based Methods 160 5.1.6. Discussion 164 5.2. Learning from Positive and Unlabeled Examples 165 5.2.1. Applications of PU Learning 165 5.2.2. Theoretical Foundation 168 5.2.3. Building Classifiers: Two Step Approach 169 5.2.4. Building Classifiers: Direct Approach 175 5.2.5. Discussion 178 Appendix: Derivation of EM for Naive Bayesian Classification •¦ 179 Bibliographic Notes 181 Part II: Web Mining 6. Information Retrieval and Web Search 183 6.1. Basic Concepts of Information Retrieval 184 6.2. Information Retrieval Models 187 6.2.1. Boolean Model 188 6.2.2. Vector Space Model 188 6.2.3. Statistical Language Model 191 6.3. Relevance Feedback 192 6.4. Evaluation Measures 195 6.5. Text and Web Page Pre Processing 199 6.5.1. Stopword Removal 199 6.5.2. Stemming 200 6.5.3. Other Pre Processing Tasks for Text 200 6.5.4. Web Page Pre Processing 201 6.5.5. Duplicate Detection 203 6.6. Inverted Index and Its Compression 204 6.6.1. Inverted Index 204 6.6.2. Search Using an Inverted Index 206 6.6.3. Index Construction 207 6.6.4. Index Compression 209 Table of Contents XV 6.7. Latent Semantic Indexing 215 6.7.1. Singular Value Decomposition 215 6.7.2. Query and Retrieval 218 6.7.3. An Example 219 6.7.4. Discussion 221 6.8. Web Search 222 6.9. Meta Search: Combining Multiple Rankings 225 6.9.1. Combination Using Similarity Scores 226 6.9.2. Combination Using Rank Positions 227 6.10. Web Spamming 229 6.10.1. Content Spamming 230 6.10.2. Link Spamming 231 6.10.3. Hiding Techniques 233 6.10.4. Combating Spam 234 Bibliographic Notes 235 7. Link Analysis 237 7.1. Social Network Analysis 238 7.1.1 Centrality 238 7.1.2 Prestige 241 7.2. Co Citation and Bibliographic Coupling 243 7.2.1. Co Citation 244 7.2.2. Bibliographic Coupling 245 7.3. PageRank 245 7.3.1. PageRank Algorithm 246 7.3.2. Strengths and Weaknesses of PageRank 253 7.3.3. Timed PageRank 254 7.4. HITS 255 7.4.1. HITS Algorithm 256 7.4.2. Finding Other Eigenvectors 259 7.4.3. Relationships with Co Citation and Bibliographic Coupling 259 7.4.4. Strengths and Weaknesses of HITS 260 7.5. Community Discovery 261 7.5.1. Problem Definition 262 7.5.2. Bipartite Core Communities 264 7.5.3. Maximum Flow Communities 265 7.5.4. Email Communities Based on Betweenness 268 7.5.5. Overlapping Communities of Named Entities 270 XVI Table of Contents Bibliographic Notes 271 8. Web Crawling 273 8.1. A Basic Crawler Algorithm 274 8.1.1. Breadth First Crawlers 275 8.1.2. Preferential Crawlers 276 8.2. Implementation Issues 277 8.2.1. Fetching 277 8.2.2. Parsing 278 8.2.3. Stopword Removal and Stemming 280 8.2.4. Link Extraction and Canonicalization 280 8.2.5. Spider Traps 282 8.2.6. Page Repository 283 8.2.7. Concurrency 284 8.3. Universal Crawlers 285 8.3.1. Scalability 286 8.3.2. Coverage vs Freshness vs Importance 288 8.4. Focused Crawlers 289 8.5. Topical Crawlers 292 8.5.1. Topical Locality and Cues 294 8.5.2. Best First Variations 300 8.5.3. Adaptation 303 8.6. Evaluation 310 8.7. Crawler Ethics and Conflicts 315 8.8. Some New Developments 318 Bibliographic Notes 320 9. Structured Data Extraction: Wrapper Generation 323 9.1 Preliminaries 324 9.1.1. Two Types of Data Rich Pages 324 9.1.2. Data Model 326 9.1.3. HTML Mark Up Encoding of Data Instances 328 9.2. Wrapper Induction 330 9.2.1. Extraction from a Page 330 9.2.2. Learning Extraction Rules 333 9.2.3. Identifying Informative Examples 337 9.2.4. Wrapper Maintenance 338 Table of Contents XVII 9.3. Instance Based Wrapper Learning 338 9.4. Automatic Wrapper Generation: Problems 341 9.4.1. Two Extraction Problems 342 9.4.2. Patterns as Regular Expressions 343 9.5. String Matching and Tree Matching 344 9.5.1. String Edit Distance 344 9.5.2. Tree Matching 346 9.6. Multiple Alignment 350 9.6.1. Center Star Method 350 9.6.2. Partial Tree Alignment 351 9.7. Building DOM Trees 356 9.8. Extraction Based on a Single List Page: Flat Data Records 357 9.8.1. Two Observations about Data Records 358 9.8.2. Mining Data Regions 359 9.8.3. Identifying Data Records in Data Regions 364 9.8.4. Data Item Alignment and Extraction 365 9.8.5. Making Use of Visual Information 366 9.8.6. Some Other Techniques 366 9.9. Extraction Based on a Single List Page: Nested Data Records 367 9.10. Extraction Based on Multiple Pages 373 9.10.1. Using Techniques in Previous Sections 373 9.10.2. RoadRunner Algorithm 374 9.11. Some Other Issues 375 9.11.1. Extraction from Other Pages 375 9.11.2. Disjunction or Optional 376 9.11.3. A Set Type or a Tuple Type 377 9.11.4. Labeling and Integration 378 9.11.5. Domain Specific Extraction 378 9.12. Discussion 379 Bibliographic Notes 379 10. Information Integration 381 10.1. Introduction to Schema Matching 382 10.2. Pre Processing for Schema Matching 384 10.3. Schema Level Match 385 XVIII Table of Contents 10.3.1. Linguistic Approaches 385 10.3.2. Constraint Based Approaches 386 10.4. Domain and Instance Level Matching 387 10.5. Combining Similarities 390 10.6. 1:m Match 391 10.7. Some Other Issues 392 10.7.1. Reuse of Previous Match Results 392 10.7.2. Matching a Large Number of Schemas 393 10.7.3 Schema Match Results 393 10.7.4 User Interactions 394 10.8. Integration of Web Query Interfaces 394 10.8.1. A Clustering Based Approach 397 10.8.2. A Correlation Based Approach 400 10.8.3. An Instance Based Approach 403 10.9. Constructing a Unified Global Query Interface 406 10.9.1. Structural Appropriateness and the Merge Algorithm 406 10.9.2. Lexical Appropriateness 408 10.9.3. Instance Appropriateness 409 Bibliographic Notes 410 11. Opinion Mining 411 11.1. Sentiment Classification 412 11.1.1. Classification Based on Sentiment Phrases 413 11.1.2. Classification Using Text Classification Methods ¦ 415 11.1.3. Classification Using a Score Function 416 11.2. Feature Based Opinion Mining and Summarization 417 11.2.1. Problem Definition 418 11.2.2. Object Feature Extraction 424 11.2.3. Feature Extraction from Pros and Cons of Format 1 425 11.2.4. Feature Extraction from Reviews of of Formats 2 and 3 429 11.2.5. Opinion Orientation Classification 430 11.3. Comparative Sentence and Relation Mining 432 11.3.1. Problem Definition 433 11.3.2. Identification of Gradable Comparative Sentences 435 Table of Contents XIX 11.3.3. Extraction of Comparative Relations 437 11.4. Opinion Search 439 11.5. Opinion Spam 441 11.5.1. Objectives and Actions of Opinion Spamming 441 11.5.2. Types of Spam and Spammers 442 11.5.3. Hiding Techniques 443 11.5.4. Spam Detection 444 Bibliographic Notes 446 12. Web Usage Mining 449 12.1. Data Collection and Pre Processing 450 12.1.1 Sources and Types of Data 452 12.1.2 Key Elements of Web Usage Data Pre Processing 455 12.2 Data Modeling for Web Usage Mining 462 12.3 Discovery and Analysis of Web Usage Patterns 466 12.3.1. Session and Visitor Analysis 466 12.3.2. Cluster Analysis and Visitor Segmentation 467 12.3.3 Association and Correlation Analysis 471 12.3.4 Analysis of Sequential and Navigational Patterns 475 12.3.5. Classification and Prediction Based on Web User Transactions 479 12.4. Discussion and Outlook 482 Bibliographic Notes 482 References 485 Index 517
any_adam_object	1
any_adam_object_boolean	1
author	Liu, Bing
author_facet	Liu, Bing
author_role	aut
author_sort	Liu, Bing
author_variant	b l bl
building	Verbundindex
bvnumber	BV021759541
classification_rvk	ST 205 ST 530
ctrlnum	(OCoLC)180943345 (DE-599)BVBBV021759541
dewey-full	005.72 006.312
dewey-hundreds	000 - Computer science, information, general works
dewey-ones	005 - Computer programming, programs, data, security 006 - Special computer methods
dewey-raw	005.72 006.312
dewey-search	005.72 006.312
dewey-sort	15.72
dewey-tens	000 - Computer science, information, general works
discipline	Informatik
discipline_str_mv	Informatik
format	Book
fullrecord	<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>00000nam a2200000 c 4500</leader><controlfield tag="001">BV021759541</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20081023</controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">061009s2007 gw ad\|\| \|\|\|\| 00\|\|\| eng d</controlfield><datafield tag="015" ind1=" " ind2=" "><subfield code="a">06,N36,0032</subfield><subfield code="2">dnb</subfield></datafield><datafield tag="016" ind1="7" ind2=" "><subfield code="a">980843286</subfield><subfield code="2">DE-101</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9783540378815</subfield><subfield code="c">Gb. : ca. EUR 58.80 (freier Pr.), ca. sfr 97.50 (freier Pr.)</subfield><subfield code="9">978-3-540-37881-5</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">3540378812</subfield><subfield code="c">Gb. : ca. EUR 58.80 (freier Pr.), ca. sfr 97.50 (freier Pr.)</subfield><subfield code="9">3-540-37881-2</subfield></datafield><datafield tag="024" ind1="3" ind2=" "><subfield code="a">9783540378815</subfield></datafield><datafield tag="028" ind1="5" ind2="2"><subfield code="a">11415190</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)180943345</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV021759541</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rakddb</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="044" ind1=" " ind2=" "><subfield code="a">gw</subfield><subfield code="c">XA-DE-BE</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-384</subfield><subfield code="a">DE-824</subfield><subfield code="a">DE-703</subfield><subfield code="a">DE-355</subfield><subfield code="a">DE-19</subfield><subfield code="a">DE-634</subfield><subfield code="a">DE-83</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">005.72</subfield><subfield code="2">22/ger</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">006.312</subfield><subfield code="2">22/ger</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 205</subfield><subfield code="0">(DE-625)143613:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 530</subfield><subfield code="0">(DE-625)143679:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">004</subfield><subfield code="2">sdnb</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Liu, Bing</subfield><subfield code="e">Verfasser</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Web data mining</subfield><subfield code="b">exploring hyperlinks, contents, and usage data</subfield><subfield code="c">Bing Liu</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Berlin [u.a.]</subfield><subfield code="b">Springer</subfield><subfield code="c">2007</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XIX, 532 S.</subfield><subfield code="b">Ill., graph. Darst.</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="0" ind2=" "><subfield code="a">Data-centric systems and applications</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Literaturverz. S. [486] - 515</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">World Wide Web</subfield><subfield code="0">(DE-588)4363898-3</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">World Wide Web</subfield><subfield code="0">(DE-588)4363898-3</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="q">text/html</subfield><subfield code="u">http://deposit.dnb.de/cgi-bin/dokserv?id=2844145&prov=M&dok_var=1&dok_ext=htm</subfield><subfield code="3">Inhaltstext</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">HBZ Datenaustausch</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=014972637&sequence=000003&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="943" ind1="1" ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-014972637</subfield></datafield></record></collection>
id	DE-604.BV021759541
illustrated	Illustrated
index_date	2024-07-02T15:34:42Z
indexdate	2024-07-20T09:08:05Z
institution	BVB
isbn	9783540378815 3540378812
language	English
oai_aleph_id	oai:aleph.bib-bvb.de:BVB01-014972637
oclc_num	180943345
open_access_boolean
owner	DE-384 DE-824 DE-703 DE-355 DE-BY-UBR DE-19 DE-BY-UBM DE-634 DE-83
owner_facet	DE-384 DE-824 DE-703 DE-355 DE-BY-UBR DE-19 DE-BY-UBM DE-634 DE-83
physical	XIX, 532 S. Ill., graph. Darst.
publishDate	2007
publishDateSearch	2007
publishDateSort	2007
publisher	Springer
record_format	marc
series2	Data-centric systems and applications
spelling	Liu, Bing Verfasser aut Web data mining exploring hyperlinks, contents, and usage data Bing Liu Berlin [u.a.] Springer 2007 XIX, 532 S. Ill., graph. Darst. txt rdacontent n rdamedia nc rdacarrier Data-centric systems and applications Literaturverz. S. [486] - 515 Data Mining (DE-588)4428654-5 gnd rswk-swf World Wide Web (DE-588)4363898-3 gnd rswk-swf World Wide Web (DE-588)4363898-3 s Data Mining (DE-588)4428654-5 s DE-604 text/html http://deposit.dnb.de/cgi-bin/dokserv?id=2844145&prov=M&dok_var=1&dok_ext=htm Inhaltstext HBZ Datenaustausch application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=014972637&sequence=000003&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis
spellingShingle	Liu, Bing Web data mining exploring hyperlinks, contents, and usage data Data Mining (DE-588)4428654-5 gnd World Wide Web (DE-588)4363898-3 gnd
subject_GND	(DE-588)4428654-5 (DE-588)4363898-3
title	Web data mining exploring hyperlinks, contents, and usage data
title_auth	Web data mining exploring hyperlinks, contents, and usage data
title_exact_search	Web data mining exploring hyperlinks, contents, and usage data
title_exact_search_txtP	Web data mining exploring hyperlinks, contents, and usage data
title_full	Web data mining exploring hyperlinks, contents, and usage data Bing Liu
title_fullStr	Web data mining exploring hyperlinks, contents, and usage data Bing Liu
title_full_unstemmed	Web data mining exploring hyperlinks, contents, and usage data Bing Liu
title_short	Web data mining
title_sort	web data mining exploring hyperlinks contents and usage data
title_sub	exploring hyperlinks, contents, and usage data
topic	Data Mining (DE-588)4428654-5 gnd World Wide Web (DE-588)4363898-3 gnd
topic_facet	Data Mining World Wide Web
url	http://deposit.dnb.de/cgi-bin/dokserv?id=2844145&prov=M&dok_var=1&dok_ext=htm http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=014972637&sequence=000003&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA
work_keys_str_mv	AT liubing webdataminingexploringhyperlinkscontentsandusagedata

Verfügbarkeit

Es ist kein Print-Exemplar vorhanden.

Fernleihe Bestellen Achtung: Nicht im THWS-Bestand! Beschreibung

MARC

Datensatz im Suchindex

Es ist kein Print-Exemplar vorhanden.

Ähnliche Einträge