Holdings: Introduction to data mining

Introduction to data mining:

Saved in:

Bibliographic Details
Main Authors:	Tan, Pang-Ning (Author), Steinbach, Michael (Author), Kumar, Vipin (Author)
Format:	Book
Language:	English
Published:	Boston ; Munich [u.a.] Pearson Addison Wesley 2006
Edition:	Pearson internat. ed.
Subjects:	Data mining Exploration de données (Informatique) Mineração de dados Recuperação da informação Data Mining Einführung
Online Access:	Table of contents Inhaltsverzeichnis
Item Description:	Includes bibliographical references and indexes
Physical Description:	XXI, 769 S. graph. Darst.
ISBN:	0321321367 0321420527

Staff View

MARC


LEADER	00000nam a2200000zc 4500
001	BV021526854
003	DE-604
005	20190228
007	t
008	060327s2006 xxud\|\|\| \|\|\|\| 00\|\|\| eng d
010			\|a 2005008721
020			\|a 0321321367 \|c alk. paper \|9 0-321-32136-7
020			\|a 0321420527 \|9 0-321-42052-7
035			\|a (OCoLC)58729322
035			\|a (DE-599)BVBBV021526854
040			\|a DE-604 \|b ger \|e aacr
041	0		\|a eng
044			\|a xxu \|c US
049			\|a DE-29T \|a DE-858 \|a DE-473 \|a DE-703 \|a DE-91G \|a DE-634 \|a DE-861 \|a DE-M347 \|a DE-523 \|a DE-20 \|a DE-1051
050		0	\|a QA76.9.D343
082	0		\|a 006.312 \|2 22
084			\|a ST 530 \|0 (DE-625)143679: \|2 rvk
084			\|a DAT 825f \|2 stub
084			\|a DAT 703f \|2 stub
100	1		\|a Tan, Pang-Ning \|e Verfasser \|0 (DE-588)1121324711 \|4 aut
245	1	0	\|a Introduction to data mining \|c Pang-Ning Tan, Michael Steinbach, Vipin Kumar
250			\|a Pearson internat. ed.
264		1	\|a Boston ; Munich [u.a.] \|b Pearson Addison Wesley \|c 2006
300			\|a XXI, 769 S. \|b graph. Darst.
336			\|b txt \|2 rdacontent
337			\|b n \|2 rdamedia
338			\|b nc \|2 rdacarrier
500			\|a Includes bibliographical references and indexes
650		7	\|a Data mining \|2 gtt
650		4	\|a Exploration de données (Informatique)
650		7	\|a Mineração de dados \|2 larpcal
650		7	\|a Recuperação da informação \|2 larpcal
650		4	\|a Data mining
650	0	7	\|a Data Mining \|0 (DE-588)4428654-5 \|2 gnd \|9 rswk-swf
655		7	\|8 1\p \|0 (DE-588)4151278-9 \|a Einführung \|2 gnd-content
689	0	0	\|a Data Mining \|0 (DE-588)4428654-5 \|D s
689	0		\|5 DE-604
700	1		\|a Steinbach, Michael \|e Verfasser \|0 (DE-588)101724216X \|4 aut
700	1		\|a Kumar, Vipin \|e Verfasser \|4 aut
856	4		\|u http://www.loc.gov/catdir/toc/ecip0510/2005008721.html \|3 Table of contents
856	4	2	\|m HBZ Datenaustausch \|q application/pdf \|u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=014743268&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA \|3 Inhaltsverzeichnis
999			\|a oai:aleph.bib-bvb.de:BVB01-014743268
883	1		\|8 1\p \|a cgwrk \|d 20201028 \|q DE-101 \|u https://d-nb.info/provenance/plan#cgwrk

Record in the Search Index

_version_	1804135271146455040
adam_text	Titel: Introduction to data mining Autor: Tan, Pang-Ning Jahr: 2006 Contents Preface vii 1 Introduction 1 1.1 What Is Data Mining?....................... 2 1.2 Motivating Challenges....................... 4 1.3 The Origins of Data Mining.................... 6 1.4 Data Mining Tasks......................... 7 1.5 Scope and Organization of the Book............... 11 1.6 Bibliographic Notes......................... 13 1.7 Exercises .............................. 16 2 Data 19 2.1 Types of Data............................ 22 2.1.1 Attributes and Measurement............... 23 2.1.2 Types of Data Sets..................... 29 2.2 Data Quality............................ 36 2.2.1 Measurement and Data Collection Issues......... 37 2.2.2 Issues Related to Applications .............. 43 2.3 Data Preprocessing......................... 44 2.3.1 Aggregation......................... 45 2.3.2 Sampling.......................... 47 2.3.3 Dimensionality Reduction................. 50 2.3.4 Feature Subset Selection.................. 52 2.3.5 Feature Creation...................... 55 2.3.6 Discretization and Binarization.............. 57 2.3.7 Variable Transformation.................. 63 2.4 Measures of Similarity and Dissimilarity............. 65 2.4.1 Basics............................ 66 2.4.2 Similarity and Dissimilarity between Simple Attributes . 67 2.4.3 Dissimilarities between Data Objects........... 69 2.4.4 Similarities between Data Objects............ 72 2.4.5 Examples of Proximity Measures............. 73 2.4.6 Issues in Proximity Calculation.............. 80 2.4.7 Selecting the Right Proximity Measure.......... 83 2.5 Bibliographic Notes......................... 84 2.6 Exercises .............................. 88 Exploring Data 97 3.1 The Iris Data Set.......................... 98 3.2 Summary Statistics......................... 98 3.2.1 Frequencies and the Mode................. 99 3.2.2 Percentiles ......................... 100 3.2.3 Measures of Location: Mean and Median........ 101 3.2.4 Measures of Spread: Range and Variance........ 102 3.2.5 Multivariate Summary Statistics............. 104 3.2.6 Other Ways to Summarize the Data........... 105 3.3 Visualization............................ 105 3.3.1 Motivations for Visualization............... 105 3.3.2 General Concepts...................... 106 3.3.3 Techniques......................... 110 3.3.4 Visualizing Higher-Dimensional Data........... 124 3.3.5 Do s and Don ts ...................... 130 3.4 OLAP and Multidimensional Data Analysis........... 131 3.4.1 Representing Iris Data as a Multidimensional Array . . 131 3.4.2 Multidimensional Data: The General Case........ 133 3.4.3 Analyzing Multidimensional Data ............ 135 3.4.4 Final Comments on Multidimensional Data Analysis . . 139 3.5 Bibliographic Notes......................... 139 3.6 Exercises .............................. 141 Classification: Basic Concepts, Decision Trees, and Model Evaluation 145 4.1 Preliminaries............................ 146 4.2 General Approach to Solving a Classification Problem..... 148 4.3 Decision Tree Induction...................... 150 4.3.1 How a Decision Tree Works................ 150 4.3.2 How to Build a Decision Tree............... 151 4.3.3 Methods for Expressing Attribute Test Conditions ... 155 4.3.4 Measures for Selecting the Best Split........... 158 4.3.5 Algorithm for Decision Tree Induction.......... 164 4.3.6 An Example: Web Robot Detection........... 166 4.3.7 Characteristics of Decision Tree Induction........ 168 4.4 Model Overfitting.......................... 172 4.4.1 Overfitting Due to Presence of Noise........... 175 4.4.2 Overfitting Due to Lack of Representative Samples ... 177 4.4.3 Overfitting and the Multiple Comparison Procedure . . 178 4.4.4 Estimation of Generalization Errors........... 179 4.4.5 Handling Overfitting in Decision Tree Induction .... 184 4.5 Evaluating the Performance of a Classifier............ 186 4.5.1 Holdout Method...................... 186 4.5.2 Random Subsampling................... 187 4.5.3 Cross-Validation...................... 187 4.5.4 Bootstrap.......................... 188 4.6 Methods for Comparing Classifiers................ 188 4.6.1 Estimating a Confidence Interval for Accuracy..... 189 4.6.2 Comparing the Performance of Two Models....... 191 4.6.3 Comparing the Performance of Two Classifiers..... 192 4.7 Bibliographic Notes......................... 193 4.8 Exercises .............................. 198 Classification: Alternative Techniques 207 5.1 Rule-Based Classifier........................ 207 5.1.1 How a Rule-Based Classifier Works............ 209 5.1.2 Rule-Ordering Schemes .................. 211 5.1.3 How to Build a Rule-Based Classifier........... 212 5.1.4 Direct Methods for Rule Extraction........... 213 5.1.5 Indirect Methods for Rule Extraction .......... 221 5.1.6 Characteristics of Rule-Based Classifiers......... 223 5.2 Nearest-Neighbor classifiers.................... 223 5.2.1 Algorithm.......................... 225 5.2.2 Characteristics of Nearest-Neighbor Classifiers..... 226 5.3 Bayesian Classifiers......................... 227 5.3.1 Bayes Theorem....................... 228 5.3.2 Using the Bayes Theorem for Classification....... 229 5.3.3 Naive Bayes Classifier................... 231 5.3.4 Bayes Error Rate...................... 238 5.3.5 Bayesian Belief Networks................. 240 5.4 Artificial Neural Network (ANN)................. 246 5.4.1 Perceptron ......................... 247 5.4.2 Multilayer Artificial Neural Network........... 251 5.4.3 Characteristics of ANN.................. 255 5.5 Support Vector Machine (SVM).................. 256 5.5.1 Maximum Margin Hyperplanes.............. 256 5.5.2 Linear SVM: Separable Case............... 259 5.5.3 Linear SVM: Nonseparable Case............. 266 5.5.4 Nonlinear SVM....................... 270 5.5.5 Characteristics of SVM .................. 276 5.6 Ensemble Methods......................... 276 5.6.1 Rationale for Ensemble Method.............. 277 5.6.2 Methods for Constructing an Ensemble Classifier .... 278 5.6.3 Bias-Variance Decomposition............... 281 5.6.4 Bagging........................... 283 5.6.5 Boosting........................... 285 5.6.6 Random Forests ...................... 290 5.6.7 Empirical Comparison among Ensemble Methods .... 294 5.7 Class Imbalance Problem..................... 294 5.7.1 Alternative Metrics..................... 295 5.7.2 The Receiver Operating Characteristic Curve...... 298 5.7.3 Cost-Sensitive Learning.................. 302 5.7.4 Sampling-Based Approaches................ 305 5.8 Multiclass Problem......................... 306 5.9 Bibliographic Notes......................... 309 5.10 Exercises .............................. 315 Association Analysis: Basic Concepts and Algorithms 327 6.1 Problem Definition......................... 328 6.2 Frequent Itemset Generation ................... 332 6.2.1 The Apriori Principle................... 333 6.2.2 Frequent Itemset Generation in the Apriori Algorithm . 335 6.2.3 Candidate Generation and Pruning............ 338 6.2.4 Support Counting..................... 342 6.2.5 Computational Complexity................ 345 6.3 Rule Generation.......................... 349 6.3.1 Confidence-Based Pruning................. 350 6.3.2 Rule Generation in Apriori Algorithm.......... 350 6.3.3 An Example: Congressional Voting Records....... 352 6.4 Compact Representation of Frequent Itemsets.......... 353 6.4.1 Maximal Frequent Itemsets................ 354 6.4.2 Closed Frequent Itemsets................. 355 6.5 Alternative Methods for Generating Frequent Itemsets..... 359 6.6 FP-Growth Algorithm....................... 363 6.6.1 FP-Tree Representation.................. 363 6.6.2 Frequent Itemset Generation in FP-Growth Algorithm . 366 6.7 Evaluation of Association Patterns................ 370 6.7.1 Objective Measures of Interestingness.......... 371 6.7.2 Measures beyond Pairs of Binary Variables....... 382 6.7.3 Simpson s Paradox..................... 384 6.8 Effect of Skewed Support Distribution.............. 386 6.9 Bibliographic Notes......................... 390 6.10 Exercises .............................. 404 Association Analysis: Advanced Concepts 415 7.1 Handling Categorical Attributes ................. 415 7.2 Handling Continuous Attributes ................. 418 7.2.1 Discretization-Based Methods............... 418 7.2.2 Statistics-Based Methods................. 422 7.2.3 Non-discretization Methods................ 424 7.3 Handling a Concept Hierarchy .................. 426 7.4 Sequential Patterns......................... 429 7.4.1 Problem Formulation ................... 429 7.4.2 Sequential Pattern Discovery............... 431 7.4.3 Timing Constraints..................... 436 7.4.4 Alternative Counting Schemes .............. 439 7.5 Subgraph Patterns......................... 442 7.5.1 Graphs and Subgraphs................... 443 7.5.2 Frequent Subgraph Mining................ 444 7.5.3 Apriori-like Method.................... 447 7.5.4 Candidate Generation................... 448 7.5.5 Candidate Pruning..................... 453 7.5.6 Support Counting..................... 457 7.6 Infrequent Patterns......................... 457 7.6.1 Negative Patterns ..................... 458 7.6.2 Negatively Correlated Patterns.............. 458 7.6.3 Comparisons among Infrequent Patterns. Negative Pat- terns, and Negatively Correlated Patterns........ 460 7.6.4 Techniques for Mining Interesting Infrequent Patterns . 461 7.6.5 Techniques Based on Mining Negative Patterns..... 463 7.6.6 Techniques Based on Support Expectation........ 465 7.7 Bibliographic Notes......................... 469 7.8 Exercises .............................. 473 Cluster Analysis: Basic Concepts and Algorithms 487 8.1 Overview.............................. 490 8.1.1 What Is Cluster Analysis?................. 490 8.1.2 Different Types of Clusterings............... 491 8.1.3 Different Types of Clusters................ 493 8.2 K-means............................... 496 8.2.1 The Basic K-means Algorithm.............. 497 8.2.2 K-means: Additional Issues................ 506 8.2.3 Bisecting K-means..................... 508 8.2.4 K-means and Different Types of Clusters........ 510 8.2.5 Strengths and Weaknesses................. 510 8.2.6 K-means as an Optimization Problem.......... 513 8.3 Agglomerative Hierarchical Clustering.............. 515 8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm 516 8.3.2 Specific Techniques..................... 518 8.3.3 The Lance-Williams Formula for Cluster Proximity . . . 524 8.3.4 Key Issues in Hierarchical Clustering........... 524 8.3.5 Strengths and Weaknesses................. 526 8.4 DBSCAN.............................. 526 8.4.1 Traditional Density: Center-Based Approach...... 527 8.4.2 The DBSCAN Algorithm................. 528 8.4.3 Strengths and Weaknesses................. 530 8.5 Cluster Evaluation......................... 532 8.5.1 Overview.......................... 533 8.5.2 Unsupervised Cluster Evaluation Using Cohesion and Separation ......................... 536 8.5.3 Unsupervised Cluster Evaluation Using the Proximity Matrix............................ 542 8.5.4 Unsupervised Evaluation of Hierarchical Clustering . . . 544 8.5.5 Determining the Correct Number of Clusters...... 546 8.5.6 Clustering Tendency.................... 547 8.5.7 Supervised Measures of Cluster Validity......... 548 8.5.8 Assessing the Significance of Cluster Validity Measures . 553 8.6 Bibliographic Notes......................... 555 8.7 Exercises .............................. 559 Cluster Analysis: Additional Issues and Algorithms 569 9.1 Characteristics of Data, Clusters, and Clustering Algorithms . 570 9.1.1 Example: Comparing K-means and DBSCAN...... 570 9.1.2 Data Characteristics.................... 571 9.1.3 Cluster Characteristics................... 573 9.1.4 General Characteristics of Clustering Algorithms .... 575 9.2 Prototype-Based Clustering.................... 577 9.2.1 Fuzzy Clustering...................... 577 9.2.2 Clustering Using Mixture Models............. 583 9.2.3 Self-Organizing Maps (SOM)............... 594 9.3 Density-Based Clustering..................... 600 9.3.1 Grid-Based Clustering................... 601 9.3.2 Subspace Clustering.................... 604 9.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based Clustering.......................... 608 9.4 Graph-Based Clustering...................... 612 9.4.1 Sparsification........................ 613 9.4.2 Minimum Spanning Tree (MST) Clustering....... 614 9.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities Using METIS........................ 616 9.4.4 Chameleon: Hierarchical Clustering with Dynamic Modeling.......................... 616 9.4.5 Shared Nearest Neighbor Similarity ........... 622 9.4.6 The Jarvis-Patrick Clustering Algorithm......... 625 9.4.7 SNN Density........................ 627 9.4.8 SNN Density-Based Clustering.............. 629 9.5 Scalable Clustering Algorithms.................. 630 9.5.1 Scalability: General Issues and Approaches....... 630 9.5.2 BIRCH........................... 633 9.5.3 CURE............................ 635 9.6 Which Clustering Algorithm?................... 639 9.7 Bibliographic Notes......................... 643 9.8 Exercises .............................. 647 10 Anomaly Detection 651 10.1 Preliminaries............................ 653 10.1.1 Causes of Anomalies.................... 653 10.1.2 Approaches to Anomaly Detection............ 654 10.1.3 The Use of Class Labels.................. 655 10.1.4 Issues............................ 656 10.2 Statistical Approaches....................... 658 10.2.1 Detecting Outliers in a Univariate Normal Distribution 659 10.2.2 Outliers in a Multivariate Normal Distribution..... 661 10.2.3 A Mixture Model Approach for Anomaly Detection . . . 662 10.2.4 Strengths and Weaknesses................. 665 10.3 Proximity-Based Outlier Detection................ 666 10.3.1 Strengths and Weaknesses................. 666 10.4 Density-Based Outlier Detection................. 668 10.4.1 Detection of Outliers Using Relative Density...... 669 10.4.2 Strengths and Weaknesses................. 670 10.5 Clustering-Based Techniques ................... 671 10.5.1 Assessing the Extent to Which an Object Belongs to a Cluster ........................... 672 10.5.2 Impact of Outliers on the Initial Clustering....... 674 10.5.3 The Number of Clusters to Use.............. 674 10.5.4 Strengths and Weaknesses................. 674 10.6 Bibliographic Notes......................... 675 10.7 Exercises .............................. 680 Appendix A Linear Algebra 685 A.l Vectors ............................... 685 A.l.l Definition.......................... 685 A. 1.2 Vector Addition and Multiplication by a Scalar..... 685 A.1.3 Vector Spaces........................ 687 A. 1.4 The Dot Product, Orthogonality, and Orthogonal Projections......................... 688 A.1.5 Vectors and Data Analysis ................ 690 A.2 Matrices............................... 691 A.2.1 Matrices: Definitions.................... 691 A.2.2 Matrices: Addition and Multiplication by a Scalar . . . 692 A.2.3 Matrices: Multiplication.................. 693 A.2.4 Linear Transformations and Inverse Matrices...... 695 A.2.5 Eigenvalue and Singular Value Decomposition...... 697 A.2.6 Matrices and Data Analysis................ 699 A.3 Bibliographic Notes......................... 700 Appendix B Dimensionality Reduction 701 B.l PCA and SVD........................... 701 B.l.l Principal Components Analysis (PCA).......... 701 B.1.2 SVD............................. 706 B.2 Other Dimensionality Reduction Techniques........... 708 B.2.1 Factor Analysis....................... 708 B.2.2 Locally Linear Embedding (LLE)............. 710 B.2.3 Multidimensional Scaling, FastMap, and ISOMAP ... 712 B.2.4 Common Issues.......................715 B.3 Bibliographic Notes.........................716 Appendix C Probability and Statistics 719 C.l Probability............................. 719 C.l.l Expected Values...................... 722 C.2 Statistics .............................. 723 C.2.1 Point Estimation...................... 724 C.2.2 Central Limit Theorem.................. 724 C.2.3 Interval Estimation..................... 725 C.3 Hypothesis Testing......................... 726 Appendix D Regression 729 D.l Preliminaries............................ 729 D.2 Simple Linear Regression ..................... 730 D.2.1 Least Square Method ................... 731 D.2.2 Analyzing Regression Errors ............... 733 D.2.3 Analyzing Goodness of Fit ................ 735 D.3 Multivariate Linear Regression.................. 736 D.4 Alternative Least-Square Regression Methods.......... 737 Appendix E Optimization 739 E.l Unconstrained Optimization....................739 E.l.l Numerical Methods ....................742 E.2 Constrained Optimization.....................746 E.2.1 Equality Constraints....................746 E.2.2 Inequality Constraints...................747 Author Index 750 Subject Index 758 Copyright Permissions 769
adam_txt	Titel: Introduction to data mining Autor: Tan, Pang-Ning Jahr: 2006 Contents Preface vii 1 Introduction 1 1.1 What Is Data Mining?. 2 1.2 Motivating Challenges. 4 1.3 The Origins of Data Mining. 6 1.4 Data Mining Tasks. 7 1.5 Scope and Organization of the Book. 11 1.6 Bibliographic Notes. 13 1.7 Exercises . 16 2 Data 19 2.1 Types of Data. 22 2.1.1 Attributes and Measurement. 23 2.1.2 Types of Data Sets. 29 2.2 Data Quality. 36 2.2.1 Measurement and Data Collection Issues. 37 2.2.2 Issues Related to Applications . 43 2.3 Data Preprocessing. 44 2.3.1 Aggregation. 45 2.3.2 Sampling. 47 2.3.3 Dimensionality Reduction. 50 2.3.4 Feature Subset Selection. 52 2.3.5 Feature Creation. 55 2.3.6 Discretization and Binarization. 57 2.3.7 Variable Transformation. 63 2.4 Measures of Similarity and Dissimilarity. 65 2.4.1 Basics. 66 2.4.2 Similarity and Dissimilarity between Simple Attributes . 67 2.4.3 Dissimilarities between Data Objects. 69 2.4.4 Similarities between Data Objects. 72 2.4.5 Examples of Proximity Measures. 73 2.4.6 Issues in Proximity Calculation. 80 2.4.7 Selecting the Right Proximity Measure. 83 2.5 Bibliographic Notes. 84 2.6 Exercises . 88 Exploring Data 97 3.1 The Iris Data Set. 98 3.2 Summary Statistics. 98 3.2.1 Frequencies and the Mode. 99 3.2.2 Percentiles . 100 3.2.3 Measures of Location: Mean and Median. 101 3.2.4 Measures of Spread: Range and Variance. 102 3.2.5 Multivariate Summary Statistics. 104 3.2.6 Other Ways to Summarize the Data. 105 3.3 Visualization. 105 3.3.1 Motivations for Visualization. 105 3.3.2 General Concepts. 106 3.3.3 Techniques. 110 3.3.4 Visualizing Higher-Dimensional Data. 124 3.3.5 Do's and Don'ts . 130 3.4 OLAP and Multidimensional Data Analysis. 131 3.4.1 Representing Iris Data as a Multidimensional Array . . 131 3.4.2 Multidimensional Data: The General Case. 133 3.4.3 Analyzing Multidimensional Data . 135 3.4.4 Final Comments on Multidimensional Data Analysis . . 139 3.5 Bibliographic Notes. 139 3.6 Exercises . 141 Classification: Basic Concepts, Decision Trees, and Model Evaluation 145 4.1 Preliminaries. 146 4.2 General Approach to Solving a Classification Problem. 148 4.3 Decision Tree Induction. 150 4.3.1 How a Decision Tree Works. 150 4.3.2 How to Build a Decision Tree. 151 4.3.3 Methods for Expressing Attribute Test Conditions . 155 4.3.4 Measures for Selecting the Best Split. 158 4.3.5 Algorithm for Decision Tree Induction. 164 4.3.6 An Example: Web Robot Detection. 166 4.3.7 Characteristics of Decision Tree Induction. 168 4.4 Model Overfitting. 172 4.4.1 Overfitting Due to Presence of Noise. 175 4.4.2 Overfitting Due to Lack of Representative Samples . 177 4.4.3 Overfitting and the Multiple Comparison Procedure . . 178 4.4.4 Estimation of Generalization Errors. 179 4.4.5 Handling Overfitting in Decision Tree Induction . 184 4.5 Evaluating the Performance of a Classifier. 186 4.5.1 Holdout Method. 186 4.5.2 Random Subsampling. 187 4.5.3 Cross-Validation. 187 4.5.4 Bootstrap. 188 4.6 Methods for Comparing Classifiers. 188 4.6.1 Estimating a Confidence Interval for Accuracy. 189 4.6.2 Comparing the Performance of Two Models. 191 4.6.3 Comparing the Performance of Two Classifiers. 192 4.7 Bibliographic Notes. 193 4.8 Exercises . 198 Classification: Alternative Techniques 207 5.1 Rule-Based Classifier. 207 5.1.1 How a Rule-Based Classifier Works. 209 5.1.2 Rule-Ordering Schemes . 211 5.1.3 How to Build a Rule-Based Classifier. 212 5.1.4 Direct Methods for Rule Extraction. 213 5.1.5 Indirect Methods for Rule Extraction . 221 5.1.6 Characteristics of Rule-Based Classifiers. 223 5.2 Nearest-Neighbor classifiers. 223 5.2.1 Algorithm. 225 5.2.2 Characteristics of Nearest-Neighbor Classifiers. 226 5.3 Bayesian Classifiers. 227 5.3.1 Bayes Theorem. 228 5.3.2 Using the Bayes Theorem for Classification. 229 5.3.3 Naive Bayes Classifier. 231 5.3.4 Bayes Error Rate. 238 5.3.5 Bayesian Belief Networks. 240 5.4 Artificial Neural Network (ANN). 246 5.4.1 Perceptron . 247 5.4.2 Multilayer Artificial Neural Network. 251 5.4.3 Characteristics of ANN. 255 5.5 Support Vector Machine (SVM). 256 5.5.1 Maximum Margin Hyperplanes. 256 5.5.2 Linear SVM: Separable Case. 259 5.5.3 Linear SVM: Nonseparable Case. 266 5.5.4 Nonlinear SVM. 270 5.5.5 Characteristics of SVM . 276 5.6 Ensemble Methods. 276 5.6.1 Rationale for Ensemble Method. 277 5.6.2 Methods for Constructing an Ensemble Classifier . 278 5.6.3 Bias-Variance Decomposition. 281 5.6.4 Bagging. 283 5.6.5 Boosting. 285 5.6.6 Random Forests . 290 5.6.7 Empirical Comparison among Ensemble Methods . 294 5.7 Class Imbalance Problem. 294 5.7.1 Alternative Metrics. 295 5.7.2 The Receiver Operating Characteristic Curve. 298 5.7.3 Cost-Sensitive Learning. 302 5.7.4 Sampling-Based Approaches. 305 5.8 Multiclass Problem. 306 5.9 Bibliographic Notes. 309 5.10 Exercises . 315 Association Analysis: Basic Concepts and Algorithms 327 6.1 Problem Definition. 328 6.2 Frequent Itemset Generation . 332 6.2.1 The Apriori Principle. 333 6.2.2 Frequent Itemset Generation in the Apriori Algorithm . 335 6.2.3 Candidate Generation and Pruning. 338 6.2.4 Support Counting. 342 6.2.5 Computational Complexity. 345 6.3 Rule Generation. 349 6.3.1 Confidence-Based Pruning. 350 6.3.2 Rule Generation in Apriori Algorithm. 350 6.3.3 An Example: Congressional Voting Records. 352 6.4 Compact Representation of Frequent Itemsets. 353 6.4.1 Maximal Frequent Itemsets. 354 6.4.2 Closed Frequent Itemsets. 355 6.5 Alternative Methods for Generating Frequent Itemsets. 359 6.6 FP-Growth Algorithm. 363 6.6.1 FP-Tree Representation. 363 6.6.2 Frequent Itemset Generation in FP-Growth Algorithm . 366 6.7 Evaluation of Association Patterns. 370 6.7.1 Objective Measures of Interestingness. 371 6.7.2 Measures beyond Pairs of Binary Variables. 382 6.7.3 Simpson's Paradox. 384 6.8 Effect of Skewed Support Distribution. 386 6.9 Bibliographic Notes. 390 6.10 Exercises . 404 Association Analysis: Advanced Concepts 415 7.1 Handling Categorical Attributes . 415 7.2 Handling Continuous Attributes . 418 7.2.1 Discretization-Based Methods. 418 7.2.2 Statistics-Based Methods. 422 7.2.3 Non-discretization Methods. 424 7.3 Handling a Concept Hierarchy . 426 7.4 Sequential Patterns. 429 7.4.1 Problem Formulation . 429 7.4.2 Sequential Pattern Discovery. 431 7.4.3 Timing Constraints. 436 7.4.4 Alternative Counting Schemes . 439 7.5 Subgraph Patterns. 442 7.5.1 Graphs and Subgraphs. 443 7.5.2 Frequent Subgraph Mining. 444 7.5.3 Apriori-like Method. 447 7.5.4 Candidate Generation. 448 7.5.5 Candidate Pruning. 453 7.5.6 Support Counting. 457 7.6 Infrequent Patterns. 457 7.6.1 Negative Patterns . 458 7.6.2 Negatively Correlated Patterns. 458 7.6.3 Comparisons among Infrequent Patterns. Negative Pat- terns, and Negatively Correlated Patterns. 460 7.6.4 Techniques for Mining Interesting Infrequent Patterns . 461 7.6.5 Techniques Based on Mining Negative Patterns. 463 7.6.6 Techniques Based on Support Expectation. 465 7.7 Bibliographic Notes. 469 7.8 Exercises . 473 Cluster Analysis: Basic Concepts and Algorithms 487 8.1 Overview. 490 8.1.1 What Is Cluster Analysis?. 490 8.1.2 Different Types of Clusterings. 491 8.1.3 Different Types of Clusters. 493 8.2 K-means. 496 8.2.1 The Basic K-means Algorithm. 497 8.2.2 K-means: Additional Issues. 506 8.2.3 Bisecting K-means. 508 8.2.4 K-means and Different Types of Clusters. 510 8.2.5 Strengths and Weaknesses. 510 8.2.6 K-means as an Optimization Problem. 513 8.3 Agglomerative Hierarchical Clustering. 515 8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm 516 8.3.2 Specific Techniques. 518 8.3.3 The Lance-Williams Formula for Cluster Proximity . . . 524 8.3.4 Key Issues in Hierarchical Clustering. 524 8.3.5 Strengths and Weaknesses. 526 8.4 DBSCAN. 526 8.4.1 Traditional Density: Center-Based Approach. 527 8.4.2 The DBSCAN Algorithm. 528 8.4.3 Strengths and Weaknesses. 530 8.5 Cluster Evaluation. 532 8.5.1 Overview. 533 8.5.2 Unsupervised Cluster Evaluation Using Cohesion and Separation . 536 8.5.3 Unsupervised Cluster Evaluation Using the Proximity Matrix. 542 8.5.4 Unsupervised Evaluation of Hierarchical Clustering . . . 544 8.5.5 Determining the Correct Number of Clusters. 546 8.5.6 Clustering Tendency. 547 8.5.7 Supervised Measures of Cluster Validity. 548 8.5.8 Assessing the Significance of Cluster Validity Measures . 553 8.6 Bibliographic Notes. 555 8.7 Exercises . 559 Cluster Analysis: Additional Issues and Algorithms 569 9.1 Characteristics of Data, Clusters, and Clustering Algorithms . 570 9.1.1 Example: Comparing K-means and DBSCAN. 570 9.1.2 Data Characteristics. 571 9.1.3 Cluster Characteristics. 573 9.1.4 General Characteristics of Clustering Algorithms . 575 9.2 Prototype-Based Clustering. 577 9.2.1 Fuzzy Clustering. 577 9.2.2 Clustering Using Mixture Models. 583 9.2.3 Self-Organizing Maps (SOM). 594 9.3 Density-Based Clustering. 600 9.3.1 Grid-Based Clustering. 601 9.3.2 Subspace Clustering. 604 9.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based Clustering. 608 9.4 Graph-Based Clustering. 612 9.4.1 Sparsification. 613 9.4.2 Minimum Spanning Tree (MST) Clustering. 614 9.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities Using METIS. 616 9.4.4 Chameleon: Hierarchical Clustering with Dynamic Modeling. 616 9.4.5 Shared Nearest Neighbor Similarity . 622 9.4.6 The Jarvis-Patrick Clustering Algorithm. 625 9.4.7 SNN Density. 627 9.4.8 SNN Density-Based Clustering. 629 9.5 Scalable Clustering Algorithms. 630 9.5.1 Scalability: General Issues and Approaches. 630 9.5.2 BIRCH. 633 9.5.3 CURE. 635 9.6 Which Clustering Algorithm?. 639 9.7 Bibliographic Notes. 643 9.8 Exercises . 647 10 Anomaly Detection 651 10.1 Preliminaries. 653 10.1.1 Causes of Anomalies. 653 10.1.2 Approaches to Anomaly Detection. 654 10.1.3 The Use of Class Labels. 655 10.1.4 Issues. 656 10.2 Statistical Approaches. 658 10.2.1 Detecting Outliers in a Univariate Normal Distribution 659 10.2.2 Outliers in a Multivariate Normal Distribution. 661 10.2.3 A Mixture Model Approach for Anomaly Detection . . . 662 10.2.4 Strengths and Weaknesses. 665 10.3 Proximity-Based Outlier Detection. 666 10.3.1 Strengths and Weaknesses. 666 10.4 Density-Based Outlier Detection. 668 10.4.1 Detection of Outliers Using Relative Density. 669 10.4.2 Strengths and Weaknesses. 670 10.5 Clustering-Based Techniques . 671 10.5.1 Assessing the Extent to Which an Object Belongs to a Cluster . 672 10.5.2 Impact of Outliers on the Initial Clustering. 674 10.5.3 The Number of Clusters to Use. 674 10.5.4 Strengths and Weaknesses. 674 10.6 Bibliographic Notes. 675 10.7 Exercises . 680 Appendix A Linear Algebra 685 A.l Vectors . 685 A.l.l Definition. 685 A. 1.2 Vector Addition and Multiplication by a Scalar. 685 A.1.3 Vector Spaces. 687 A. 1.4 The Dot Product, Orthogonality, and Orthogonal Projections. 688 A.1.5 Vectors and Data Analysis . 690 A.2 Matrices. 691 A.2.1 Matrices: Definitions. 691 A.2.2 Matrices: Addition and Multiplication by a Scalar . . . 692 A.2.3 Matrices: Multiplication. 693 A.2.4 Linear Transformations and Inverse Matrices. 695 A.2.5 Eigenvalue and Singular Value Decomposition. 697 A.2.6 Matrices and Data Analysis. 699 A.3 Bibliographic Notes. 700 Appendix B Dimensionality Reduction 701 B.l PCA and SVD. 701 B.l.l Principal Components Analysis (PCA). 701 B.1.2 SVD. 706 B.2 Other Dimensionality Reduction Techniques. 708 B.2.1 Factor Analysis. 708 B.2.2 Locally Linear Embedding (LLE). 710 B.2.3 Multidimensional Scaling, FastMap, and ISOMAP . 712 B.2.4 Common Issues.715 B.3 Bibliographic Notes.716 Appendix C Probability and Statistics 719 C.l Probability. 719 C.l.l Expected Values. 722 C.2 Statistics . 723 C.2.1 Point Estimation. 724 C.2.2 Central Limit Theorem. 724 C.2.3 Interval Estimation. 725 C.3 Hypothesis Testing. 726 Appendix D Regression 729 D.l Preliminaries. 729 D.2 Simple Linear Regression . 730 D.2.1 Least Square Method . 731 D.2.2 Analyzing Regression Errors . 733 D.2.3 Analyzing Goodness of Fit . 735 D.3 Multivariate Linear Regression. 736 D.4 Alternative Least-Square Regression Methods. 737 Appendix E Optimization 739 E.l Unconstrained Optimization.739 E.l.l Numerical Methods .742 E.2 Constrained Optimization.746 E.2.1 Equality Constraints.746 E.2.2 Inequality Constraints.747 Author Index 750 Subject Index 758 Copyright Permissions 769
any_adam_object	1
any_adam_object_boolean	1
author	Tan, Pang-Ning Steinbach, Michael Kumar, Vipin
author_GND	(DE-588)1121324711 (DE-588)101724216X
author_facet	Tan, Pang-Ning Steinbach, Michael Kumar, Vipin
author_role	aut aut aut
author_sort	Tan, Pang-Ning
author_variant	p n t pnt m s ms v k vk
building	Verbundindex
bvnumber	BV021526854
callnumber-first	Q - Science
callnumber-label	QA76
callnumber-raw	QA76.9.D343
callnumber-search	QA76.9.D343
callnumber-sort	QA 276.9 D343
callnumber-subject	QA - Mathematics
classification_rvk	ST 530
classification_tum	DAT 825f DAT 703f
ctrlnum	(OCoLC)58729322 (DE-599)BVBBV021526854
dewey-full	006.312
dewey-hundreds	000 - Computer science, information, general works
dewey-ones	006 - Special computer methods
dewey-raw	006.312
dewey-search	006.312
dewey-sort	16.312
dewey-tens	000 - Computer science, information, general works
discipline	Informatik
discipline_str_mv	Informatik
edition	Pearson internat. ed.
format	Book
fullrecord	<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>02174nam a2200541zc 4500</leader><controlfield tag="001">BV021526854</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20190228 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">060327s2006 xxud\|\|\| \|\|\|\| 00\|\|\| eng d</controlfield><datafield tag="010" ind1=" " ind2=" "><subfield code="a">2005008721</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">0321321367</subfield><subfield code="c">alk. paper</subfield><subfield code="9">0-321-32136-7</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">0321420527</subfield><subfield code="9">0-321-42052-7</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)58729322</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV021526854</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">aacr</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="044" ind1=" " ind2=" "><subfield code="a">xxu</subfield><subfield code="c">US</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-29T</subfield><subfield code="a">DE-858</subfield><subfield code="a">DE-473</subfield><subfield code="a">DE-703</subfield><subfield code="a">DE-91G</subfield><subfield code="a">DE-634</subfield><subfield code="a">DE-861</subfield><subfield code="a">DE-M347</subfield><subfield code="a">DE-523</subfield><subfield code="a">DE-20</subfield><subfield code="a">DE-1051</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">QA76.9.D343</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">006.312</subfield><subfield code="2">22</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 530</subfield><subfield code="0">(DE-625)143679:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">DAT 825f</subfield><subfield code="2">stub</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">DAT 703f</subfield><subfield code="2">stub</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Tan, Pang-Ning</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1121324711</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Introduction to data mining</subfield><subfield code="c">Pang-Ning Tan, Michael Steinbach, Vipin Kumar</subfield></datafield><datafield tag="250" ind1=" " ind2=" "><subfield code="a">Pearson internat. ed.</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Boston ; Munich [u.a.]</subfield><subfield code="b">Pearson Addison Wesley</subfield><subfield code="c">2006</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XXI, 769 S.</subfield><subfield code="b">graph. Darst.</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Includes bibliographical references and indexes</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Data mining</subfield><subfield code="2">gtt</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Exploration de données (Informatique)</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Mineração de dados</subfield><subfield code="2">larpcal</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Recuperação da informação</subfield><subfield code="2">larpcal</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data mining</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="655" ind1=" " ind2="7"><subfield code="8">1\p</subfield><subfield code="0">(DE-588)4151278-9</subfield><subfield code="a">Einführung</subfield><subfield code="2">gnd-content</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Steinbach, Michael</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)101724216X</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Kumar, Vipin</subfield><subfield code="e">Verfasser</subfield><subfield code="4">aut</subfield></datafield><datafield tag="856" ind1="4" ind2=" "><subfield code="u">http://www.loc.gov/catdir/toc/ecip0510/2005008721.html</subfield><subfield code="3">Table of contents</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">HBZ Datenaustausch</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=014743268&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-014743268</subfield></datafield><datafield tag="883" ind1="1" ind2=" "><subfield code="8">1\p</subfield><subfield code="a">cgwrk</subfield><subfield code="d">20201028</subfield><subfield code="q">DE-101</subfield><subfield code="u">https://d-nb.info/provenance/plan#cgwrk</subfield></datafield></record></collection>
genre	1\p (DE-588)4151278-9 Einführung gnd-content
genre_facet	Einführung
id	DE-604.BV021526854
illustrated	Illustrated
index_date	2024-07-02T14:24:11Z
indexdate	2024-07-09T20:37:51Z
institution	BVB
isbn	0321321367 0321420527
language	English
lccn	2005008721
oai_aleph_id	oai:aleph.bib-bvb.de:BVB01-014743268
oclc_num	58729322
open_access_boolean
owner	DE-29T DE-858 DE-473 DE-BY-UBG DE-703 DE-91G DE-BY-TUM DE-634 DE-861 DE-M347 DE-523 DE-20 DE-1051
owner_facet	DE-29T DE-858 DE-473 DE-BY-UBG DE-703 DE-91G DE-BY-TUM DE-634 DE-861 DE-M347 DE-523 DE-20 DE-1051
physical	XXI, 769 S. graph. Darst.
publishDate	2006
publishDateSearch	2006
publishDateSort	2006
publisher	Pearson Addison Wesley
record_format	marc
spelling	Tan, Pang-Ning Verfasser (DE-588)1121324711 aut Introduction to data mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar Pearson internat. ed. Boston ; Munich [u.a.] Pearson Addison Wesley 2006 XXI, 769 S. graph. Darst. txt rdacontent n rdamedia nc rdacarrier Includes bibliographical references and indexes Data mining gtt Exploration de données (Informatique) Mineração de dados larpcal Recuperação da informação larpcal Data mining Data Mining (DE-588)4428654-5 gnd rswk-swf 1\p (DE-588)4151278-9 Einführung gnd-content Data Mining (DE-588)4428654-5 s DE-604 Steinbach, Michael Verfasser (DE-588)101724216X aut Kumar, Vipin Verfasser aut http://www.loc.gov/catdir/toc/ecip0510/2005008721.html Table of contents HBZ Datenaustausch application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=014743268&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis 1\p cgwrk 20201028 DE-101 https://d-nb.info/provenance/plan#cgwrk
spellingShingle	Tan, Pang-Ning Steinbach, Michael Kumar, Vipin Introduction to data mining Data mining gtt Exploration de données (Informatique) Mineração de dados larpcal Recuperação da informação larpcal Data mining Data Mining (DE-588)4428654-5 gnd
subject_GND	(DE-588)4428654-5 (DE-588)4151278-9
title	Introduction to data mining
title_auth	Introduction to data mining
title_exact_search	Introduction to data mining
title_exact_search_txtP	Introduction to data mining
title_full	Introduction to data mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar
title_fullStr	Introduction to data mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar
title_full_unstemmed	Introduction to data mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar
title_short	Introduction to data mining
title_sort	introduction to data mining
topic	Data mining gtt Exploration de données (Informatique) Mineração de dados larpcal Recuperação da informação larpcal Data mining Data Mining (DE-588)4428654-5 gnd
topic_facet	Data mining Exploration de données (Informatique) Mineração de dados Recuperação da informação Data Mining Einführung
url	http://www.loc.gov/catdir/toc/ecip0510/2005008721.html http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=014743268&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA
work_keys_str_mv	AT tanpangning introductiontodatamining AT steinbachmichael introductiontodatamining AT kumarvipin introductiontodatamining

Holdings

There is no print copy available.

Interlibrary loan Place Request Caution: Not in THWS collection! Indexes

MARC

Record in the Search Index

There is no print copy available.

Similar Items