Data mining and predictive analytics:
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Hoboken, NJ
Wiley
2015
|
Ausgabe: | 2. ed. |
Schriftenreihe: | Wiley series on methods and applications in data mining
|
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis |
Beschreibung: | Auch angekündigt u.d.T.: Data mining methods and models |
Beschreibung: | XXIX, 794 S. Ill., graph. Darst. |
ISBN: | 9781118116197 |
Internformat
MARC
LEADER | 00000nam a22000002c 4500 | ||
---|---|---|---|
001 | BV042017500 | ||
003 | DE-604 | ||
005 | 20190417 | ||
007 | t | ||
008 | 140808s2015 ad|| |||| 00||| eng d | ||
020 | |a 9781118116197 |c hbk. |9 978-1-118-11619-7 | ||
035 | |a (OCoLC)908652508 | ||
035 | |a (DE-599)BSZ407445226 | ||
040 | |a DE-604 |b ger | ||
041 | 0 | |a eng | |
049 | |a DE-91G |a DE-1050 |a DE-1049 |a DE-29T |a DE-384 |a DE-91 |a DE-862 |a DE-M382 |a DE-859 |a DE-706 |a DE-945 |a DE-573 |a DE-11 |a DE-Aug4 |a DE-739 | ||
082 | 0 | |a 006.3/12 | |
084 | |a ST 530 |0 (DE-625)143679: |2 rvk | ||
084 | |a DAT 703f |2 stub | ||
100 | 1 | |a Larose, Daniel T. |e Verfasser |0 (DE-588)1062529189 |4 aut | |
245 | 1 | 0 | |a Data mining and predictive analytics |c Daniel T. Larose ; Chantal D. Larose |
250 | |a 2. ed. | ||
264 | 1 | |a Hoboken, NJ |b Wiley |c 2015 | |
300 | |a XXIX, 794 S. |b Ill., graph. Darst. | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
490 | 0 | |a Wiley series on methods and applications in data mining | |
500 | |a Auch angekündigt u.d.T.: Data mining methods and models | ||
650 | 0 | 7 | |a Data Mining |0 (DE-588)4428654-5 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Vorhersagetheorie |0 (DE-588)4188671-9 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Data Mining |0 (DE-588)4428654-5 |D s |
689 | 0 | 1 | |a Vorhersagetheorie |0 (DE-588)4188671-9 |D s |
689 | 0 | |5 DE-604 | |
700 | 1 | |a Larose, Chantal D. |e Verfasser |0 (DE-588)1062529227 |4 aut | |
856 | 4 | 2 | |m HBZ Datenaustausch |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027459262&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-027459262 |
Datensatz im Suchindex
DE-BY-862_location | 2000 |
---|---|
DE-BY-FWS_call_number | 2000/ST 530 L331(2) |
DE-BY-FWS_katkey | 579497 |
DE-BY-FWS_media_number | 083000513404 |
_version_ | 1816933555001884672 |
adam_text | Titel: Data mining and predictive analytics
Autor: Larose, Daniel T
Jahr: 2015
CONTENTS PREFACE ACKNOWLEDGMENTS PARTI __ DATA PREPARATION i CHAPTER 1 AN INTRODUCTION TO DATA MINING AND PREDICTIVE ANALYTICS 3 l. 1 What is Data Mining? What is Predictive Analytics? 3 1.2 Wanted: Data Miners 5 1.3 The Need for Human Direction of Data Mining 6 1.4 The Cross-Industry Standard Process for Data Mining: CRISP-DM 6 1.4.1 CRISP-DM: The Six Phases 7 1.5 Fallacies of Data Mining 9 1.6 What Tasks Can Data Mining Accomplish 10 1.6.1 Description 10 1.6.2 Estimation 11 1.6.3 Prediction 12 1.6.4 Classification 1 1.6.5 Clustering 15 1.6.6 Association 16 The R Zone 17 R References 18 Exercises 18 CHAPTER 2 DATA PREPROCESSING 20 2.1 Why do We Need to Preprocess the Data? 20 2.2 Data Cleaning 21 2.3 Handling Missing Data 22 2.4 Identifying Misclassifications 25 2.5 Graphical Methods for Identifying Outliers 26 2.6 Measures of Center and Spread 27 2.7 Data Transformation 30 2.8 Min-Max Normalization 30 2.9 Z-Score Standardization 31 2.10 Decimal Scaling 32 2.11 Transformations to Achieve Normality 32 vii
viii CONTENTS 2.12 Numerical Methods for Identifying Outliers 38 2.13 Flag Variables 39 2.14 Transforming Categorical Variables into Numerical Variables 40 2.15 Binning Numerical Variables 41 2.16 Reclassifying Categorical Variables 42 2.17 Adding an Index Field 43 2.18 Removing Variables that are not Useful 43 2.19 Variables that Should Probably not be Removed 43 2.20 Removal of Duplicate Records 44 2.21 A Word About ID Fields 45 The R Zone 45 R Reference 51 Exercises 51 CHAPTER 3 EXPLORATORY DATA ANALYSIS 54 3.1 Hypothesis Testing Versus Exploratory Data Analysis 54 3.2 Getting to Know the Data Set 54 3.3 Exploring Categorical Variables 56 3.4 Exploring Numeric Variables 64 3.5 Exploring Multivariate Relationships 69 3.6 Selecting Interesting Subsets of the Data for Further Investigation 70 3.7 Using EDA to Uncover Anomalous Fields 71 3.8 Binning Based on Predictive Value 72 3.9 Deriving New Variables: Flag Variables 75 3.10 Deriving New Variables: Numerical Variables 77 3.11 Using EDA to Investigate Correlated Predictor Variables 78 3.12 Summary of Our EDA 81 The R Zone 82 R References 89 Exercises 89 CHAPTER 4 DIMENSION-REDUCTION METHODS 92 4.1 Need for Dimension-Reduction in Data Mining 92 4.2 Principal Components Analysis 93 4.3 Applying PCA to the Houses Data Set 96 4.4 How Many Components Should We Extract? 102 4.4.1 The Eigenvalue Criterion 102 4.4.2 The Proportion of Variance Explained Criterion 103 4.4.3 The Minimum Communality Criterion 103 4.4.4 The Scree Plot Criterion 103 4.5 Profiling the Principal Components 105 4.6 Communalities 108 4.6.1 Minimum Communality Criterion 109 4.7 Validation of the Principal Components 110 4.8 Factor Analysis 110 4.9 Applying Factor Analysis to the Adult Data Set 111 4.10 Factor Rotation 114 4.11 User-Defined Composites 117
CONTENTS IX 4.12 An Example of a User-Defined Composite 118 The R Zone 119 R References 124 Exercises 124 PART II _ STATISTICAL ANALYSIS 129 CHAPTERS UNIVARIATE STATISTICAL ANALYSIS 131 5.1 Data Mining Tasks in Discovering Knowledge in Data 131 5.2 Statistical Approaches to Estimation and Prediction 131 5.3 Statistical Inference 132 5.4 How Confident are We in Our Estimates? 133 5.5 Confidence Interval Estimation of the Mean 134 5.6 How to Reduce the Margin of Error 136 5.7 Confidence Interval Estimation of the Proportion 137 5.8 Hypothesis Testing for the Mean 138 5.9 Assessing the Strength of Evidence Against the Null Hypothesis 140 5.10 Using Confidence Intervals to Perform Hypothesis Tests 141 5.11 Hypothesis Testing for the Proportion 143 Reference 144 The R Zone 144 R Reference 145 Exercises 145 CHAPTER 6 MULTIVARIATE STATISTICS 148 6.1 Two-Sample f-Test for Difference in Means 148 6.2 Two-Sample Z-Test for Difference in Proportions 149 6.3 Test for the Homogeneity of Proportions 150 6.4 Chi-Square Test for Goodness of Fit of Multinomial Data 152 6.5 Analysis of Variance 153 Reference 156 The R Zone 157 R Reference 158 Exercises 158 CHAPTER 7 PREPARING TO MODEL THE DATA 160 7.1 Supervised Versus Unsupervised Methods 160 7.2 Statistical Methodology and Data Mining Methodology 161 7.3 Cross-Validation 161 7.4 Overfitting 163 7.5 Bias-Variance Trade-Off 164 7.6 Balancing the Training Data Set 166 7.7 Establishing Baseline Performance 167 The R Zone 168
X CONTENTS R Reference 169 Exercises 169 CHAPTERS SIMPLE LINEAR REGRESSION _171 8.1 An Example of Simple Linear Regression 171 8.1.1 The Least-Squares Estimates 174 8.2 Dangers of Extrapolation 177 8.3 How Useful is the Regression? The Coefficient of Determination, r 2 178 8.4 Standard Error of the Estimate, s 183 8.5 Correlation Coefficient r 184 8.6 Anova Table for Simple Linear Regression 186 8.7 Outliers, High Leverage Points, and Influential Observations 186 8.8 Population Regression Equation 195 8.9 Verifying the Regression Assumptions 198 8.10 Inference in Regression 203 8.11 i-Test for the Relationship Between x and y 204 8.12 Confidence Interval for the Slope of the Regression Line 206 8.13 Confidence Interval for the Correlation Coefficient p 208 8.14 Confidence Interval for the Mean Value of y Given x 210 8.15 Prediction Interval for a Randomly Chosen Value of y Given x 211 — 8.16 Transformations to Achieve Linearity 213 8.17 Box-Cox Transformations 220 The R Zone 220 R References
227 Exercises 227 CHAPTER 9 MULTIPLE REGRESSION AND MODEL BUILDING _236 9.1 An Example of Multiple Regression 236 9.2 The Population Multiple Regression Equation 242 9.3 Inference in Multiple Regression 243 9.3.1 The i-Test for the Relationship Between y and x ; 243 9.3.2 r-Test for Relationship Between Nutritional Rating and Sugars 244 9.3.3 i-Test for Relationship Between Nutritional Rating and Fiber Content 244 9.3.4 The F-Test for the Significance of the Overall Regression Model 245 9.3.5 F-Test for Relationship between Nutritional Rating and {Sugar and Fiber}, Taken Together 247 9.3.6 The Confidence Interval for a Particular Coefficient, /1, 247 9.3.7 The Confidence Interval for the Mean Value of y, Given x v x 2 , ... ,x m 248 9.3.8 The Prediction Interval for a Randomly Chosen Value of y, Given x l ,x 2 ,...,x m 248 9.4 Regression with Categorical Predictors, Using Indicator Variables 249 9.5 Adjusting R 2 : Penalizing Models for Including Predictors that are not Useful 256 9.6 Sequential
Sums of Squares 257 9.7 Multicollinearity 258 9.8 Variable Selection Methods 266 9.8.1 The Partial F-Test 266
CONTENTS XI 9.8.2 The Forward Selection Procedure 268 9.8.3 The Backward Elimination Procedure 268 9.8.4 The Stepwise Procedure 268 9.8.5 The Best Subsets Procedure 269 9.8.6 The All-Possible-Subsets Procedure 269 9.9 Gas Mileage Data Set 270 9.10 An Application of Variable Selection Methods 271 9.10.1 Forward Selection Procedure Applied to the Gas Mileage Data Set 271 9.10.2 Backward Elimination Procedure Applied to the Gas Mileage Data Set 273 9.10.3 The Stepwise Selection Procedure Applied to the Gas Mileage Data Set 273 9.10.4 Best Subsets Procedure Applied to the Gas Mileage Data Set 274 9.10.5 Mallows’ C p Statistic 275 9.11 Using the Principal Components as Predictors in Multiple Regression 279 The R Zone 284 R References 292 Exercises 293 PART III _ CLASSIFICATION 299 CHAPTER 10 k-NEAREST NEIGHBOR ALGORITHM 301 10.1 Classification Task 301 10.2 ¿-Nearest Neighbor Algorithm 302 10.3 Distance Function 305 10.4 Combination Function 307 10.4.1 Simple Unweighted Voting 307 10.4.2 Weighted Voting 308 10.5 Quantifying Attribute Relevance: Stretching the Axes 309 10.6 Database Considerations 310 10.7 ¿-Nearest Neighbor Algorithm for Estimation and Prediction 310 10.8 Choosing k 311 10.9 Application of ¿-Nearest Neighbor Algorithm Using IBM/SPSS Modeler 312 The R Zone 312 R References 315 Exercises 315 CHAPTER 11 DECISION TREES 317 11.1 What is a Decision Tree? 317 11.2 Requirements for Using Decision Trees 319 11.3 Classification and Regression Trees 319 11.4 C4.5 Algorithm 326 11.5 Decision Rules 332 11.6 Comparison of the C5.0 and CART Algorithms Applied to Real Data 332 The R Zone 335
xii CONTENTS R References 337 Exercises 337 CHAPTER 12 NEURAL NETWORKS 339 12.1 Input and Output Encoding 339 12.2 Neural Networks for Estimation and Prediction 342 12.3 Simple Example of a Neural Network 342 12.4 Sigmoid Activation Function 344 12.5 Back-Propagation 345 12.6 Gradient-Descent Method 346 12.7 Back-Propagation Rules 347 12.8 Example of Back-Propagation 347 12.9 Termination Criteria 349 12.10 Learning Rate 350 12.11 Momentum Term 351 12.12 Sensitivity Analysis 353 12.13 Application of Neural Network Modeling 353 The R Zone 356 R References 357 Exercises 357 CHAPTER 13 LOGISTIC REGRESSION 359 13.1 Simple Example of Logistic Regression 359 13.2 Maximum Likelihood Estimation 361 13.3 Interpreting Logistic Regression Output 362 13.4 Inference: are the Predictors Significant? 363 13.5 Odds Ratio and Relative Risk 365 13.6 Interpreting Logistic Regression for a Dichotomous Predictor 367 13.7 Interpreting Logistic Regression for a Polychotomous Predictor 370 13.8 Interpreting Logistic Regression for a Continuous Predictor 374 13.9 Assumption of Linearity 378 13.10 Zero-Cell Problem 382 13.11 Multiple Logistic Regression 384 13.12 Introducing Higher Order Terms to Handle Nonlinearity 388 13.13 Validating the Logistic Regression Model 395 13.14 WEKA: Hands-On Analysis Using Logistic Regression 399 The R Zone 404 R References 409 Exercises 409 chapter 14 NAIVE BAYES AND BAYESIAN NETWORKS 414 14.1 Bayesian Approach 414 14.2 Maximum a Posteriori (Map) Classification 416 14.3 Posterior Odds Ratio 420
CONTENTS Xiii 14.4 Balancing the Data 422 14.5 Naïve Bayes Classification 423 14.6 Interpreting the Log Posterior Odds Ratio 426 14.7 Zero-Cell Problem 428 14.8 Numeric Predictors for Naïve Bayes Classification 429 14.9 WEKA: Hands-on Analysis Using Naïve Bayes 432 14.10 Bayesian Belief Networks 436 14.11 Clothing Purchase Example 436 14.12 Using the Bayesian Network to Find Probabilities 439 14.12.1 WEKA: Hands-on Analysis Using Bayes Net 441 The R Zone 444 R References 448 Exercises 448 CHAPTER 15 MODEL EVALUATION TECHNIQUES 451 15.1 Model Evaluation Techniques for the Description Task 451 15.2 Model Evaluation Techniques for the Estimation and Prediction Tasks 452 15.3 Model Evaluation Measures for the Classification Task 454 15.4 Accuracy and Overall Error Rate 456 15.5 Sensitivity and Specificity 457 15.6 False-Positive Rate and False-Negative Rate 458 15.7 Proportions of True Positives, True Negatives, False Positives, and False Negatives 458 15.8 Misclassification Cost Adjustment to Reflect Real-World Concerns 460 15.9 Decision Cost/Benefit Analysis 462 15.10 Lift Charts and Gains Charts 463 15.11 Interweaving Model Evaluation with Model Building 466 15.12 Confluence of Results: Applying a Suite of Models 466 The R Zone 467 R References 468 Exercises 468 CHAPTER 16 COST-BENEFIT ANALYSIS USING DATA-DRIVEN COSTS 471 16.1 Decision Invariance Under Row Adjustment 471 16.2 Positive Classification Criterion 473 16.3 Demonstration of the Positive Classification Criterion 474 16.4 Constructing the Cost Matrix 474 16.5 Decision Invariance Under Scaling 476 16.6 Direct Costs and Opportunity Costs 478 16.7 Case Study: Cost-Benefit Analysis Using Data-Driven Misclassification Costs 478 16.8 Rebalancing as a Surrogate for Misclassification Costs 483 The R Zone 485 R References 487 Exercises 487
xlv CONTÎNT® CHAPTIR IT COST BENEFIT ANALYSIS FOR TRINARYAND k-NARY _ CLASSIFICATION MODELS _ 1?. I Classification Evaluation Measures for a Generic Trinary Target 49t 1 1-2 Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem m IW Bata-Driven Cost-Benefit Analysis for Trinary Loan Classification Problem 498 11.4 Comparing Cart Models with and without Data-Driven Misclassification Costs see 1 IS Classification Evaluation Measures for a Generic ¿-Nary Target se VI ,b Example of Evaluation Measures and Data-Driven Misdasstficatioo Costs fer ¿--Nary Classification sw TfioRtoae 3d? l .l Review ofyftCtate»diMisisCiiatis sw IStl yftCtate^GateC iMteEswigMtesîassiiie^àonOBSts si* 1 .3 Eo^eesfeCterts m H .4 fteftfeCtafe ïfâ i .S cgie» tex üSei«fèfflit C aiate; SW TteElfeiii sus RRefesesfôtis S¥T it«« st® 1 ®Ü 1 ®hè 0 Mstiiltf Mk m l$ 3 $ Sft^fe- 5 ifliiap;ClMsit 9 iiitig m ll^ 4 t W B^îBi|lîèQff^-J^ntsC]fit 4 Effliiî%Qli^ifk SU) ljih A^iMtftîîi^fe-^RittîftClUsiæiiiîgtCBiiiE’^^lBrttaipifBÈMitfia- » 11 1 î)4) Etetirti!:têfriPdfôîîiVmïïJtiîfi rife lîfisiiifeit(CJhiTjfîi W7 Tpifeikaate; m ikfel ftrfStRss r m 5 #j
CONTENTS XV CHAPTER 20 KOHONEN NETWORKS M2 20.1 Self-Organizing Maps M2 20.2 Kohonen Networks 344 20.3 Example of a Kohonen Network Study 545 20.4 Cluster Validity 549 20.5 Application of Clustering Using Kohonen Networks M9 20.6 Interpreting The Clusters 551 20.6.1 Cluster Profiles 554 20.7 Using Cluster Membership as Input to Downstream Data Mining Models 556 The R Zone 557 R References 558 Exercises 558 CHAPTER 21 BIRCH CLUSTERING 560 21.1 Rationale for Birch Clustering 560 21.2 Cluster Features 561 21.3 Cluster Feature Tree 562 21.4 Phase 1: Building the CF Tree 562 21.5 Phase 2: Clustering the Sub-Clusters 564 21.6 Example of Birch Clustering, Phase 1: Building the CF Tree 565 21.7 Example of Birch Clustering, Phase 2: Clustering the Sub-Clusters 570 21.8 Evaluating the Candidate Cluster Solutions 571 21.9 Case Study: Applying Birch Clustering to the Bank Loans Data Set 571 21.9.1 Case Study Lesson One: Avoid Highly Correlated Inputs to Any Clustering Algorithm 572 21.9.2 Case Study Lesson Two: Different Sortings May Lead to Different Numbers of Clusters 577 The R Zone 579 R References 580 Exercises 580 CHAPTER 22 MEASURING CLUSTER GOODNESS 582 22.1 Rationale for Measuring Cluster Goodness 582 22.2 The Silhouette Method 583 22.3 Silhouette Example 584 22.4 Silhouette Analysis of the IRIS Data Set 585 22.5 The Pseudo-F Statistic 580 22.6 Example of the Pseudo-F Statistic 591 22.7 Pseudo-F Statistic Applied to the IRIS Data Set 582 22.8 Cluster Validation 593 22.9 Cluster Validation Applied to the Loans Data Set 594 The R Zone 597 R References 599 Exercises 599
XVI CONTENTS PARTY _ ASSOCIATION RULES m CHAPTER 23 ASSOCIATION RULES 603 23.1 Affinity Analysis and Market Basket Analysis 603 23.1.1 Data Representation for Market B asket Analysis 604 23.2 Support, Confidence, Frequent Itemsets, and the a Priori Property 605 23.3 How Does the a Priori Algorithm Work (Part 1)? Generating Frequent Itemsets 607 23.4 How Does the a Priori Algorithm Work (Part 2)? Generating Association Rules 608 23.5 Extension from Flag Data to General Categorical Data 611 23.6 Information-Theoretic Approach: Generalized Rule Induction Method 612 23.6.1 7-Measure 613 23.7 Association Rules are Easy to do Badly 614 23.8 How can we Measure the Usefulness of Association Rules? 615 23.9 Do Association Rules Represent Supervised or Unsupervised Learning? 616 23.10 Local Patterns Versus Global Models 617 The R Zone 618 R References 618 Exercises 619 PART VI _ ENHANCING MODEL PERFORMANCE 623 CHAPTER 24 SEGMENTATION MODELS 625 24. 1 The Segmentation Modeling Process 625 24.2 Segmentation Modeling Using EDA to Identify the Segments 627 24.3 Segmentation Modeling using Clustering to Identify the Segments 629 The R Zone 634 R References 635 Exercises 635 chapter 25 ENSEMBLE METHODS: BAGGING AND BOOSTING 637 25.1 Rationale for Using an Ensemble of Classification Models 637 25.2 Bias, Variance, and Noise 639 25.3 When to Apply, and not to apply, Bagging 640 25.4 Bagging 641 25.5 Boosting 643 25.6 Application of Bagging and Boosting Using IBM/SPSS Modeler 647 References 648 The R Zone 649 R Reference 650 Exercises 650
CONTENTS XVII CHAPTER 26 MODEL VOTING AND PROPENSITY AVERAGING 653 26.1 Simple Model Voting 653 26.2 Alternative Voting Methods 654 26.3 Model Voting Process 655 26.4 An Application of Model Voting 656 26.5 What is Propensity Averaging? 660 26.6 Propensity Averaging Process 661 26.7 An Application of Propensity Averaging 661 The R Zone 665 R References 666 Exercises 666 PART VII __ FURTHER TOPICS 669 CHAPTER 27 GENETIC ALGORITHMS 671 27.1 Introduction To Genetic Algorithms 671 27.2 Basic Framework of a Genetic Algorithm 672 27.3 Simple Example of a Genetic Algorithm at Work 673 27.3.1 First Iteration 674 27.3.2 Second Iteration 675 27.4 Modifications and Enhancements: Selection 676 27.5 Modifications and Enhancements: Crossover 678 27.5.1 Multi-Point Crossover 678 27.5.2 Uniform Crossover 678 27.6 Genetic Algorithms for Real-Valued Variables 679 27.6.1 Single Arithmetic Crossover 680 27.6.2 Simple Arithmetic Crossover 680 27.6.3 Whole Arithmetic Crossover 680 27.6.4 Discrete Crossover 681 27.6.5 Normally Distributed Mutation 681 27.7 Using Genetic Algorithms to Train a Neural Network 681 27.8 WEKA: Hands-On Analysis Using Genetic Algorithms 684 The R Zone 692 R References 693 Exercises 693 CHAPTER 28 IMPUTATION OF MISSING DATA 695 28.1 Need for Imputation of Missing Data 695 28.2 Imputation of Missing Data: Continuous Variables 696 28.3 Standard Error of the Imputation 699 28.4 Imputation of Missing Data: Categorical Variables 700 28.5 Handling Patterns in Missingness 701 Reference 701 The R Zone 702
xviii CONTENTS R References 704 Exercises 704 PART VIII _ CASE STUDY: PREDICTING RESPONSE TO DIRECT-MAIL MARKETING 705 CHAPTER 29 CASE STUDY, PART 1: BUSINESS UNDERSTANDING, DATA PREPARATION, AND EDA _707 29.1 Cross-Industry Standard Practice for Data Mining 707 29.2 Business Understanding Phase 709 29.3 Data Understanding Phase, Part 1: Getting a Feel for the Data Set 710 29.4 Data Preparation Phase 714 29.4.1 Negative Amounts Spent? 714 29.4.2 Transformations to Achieve Normality or Symmetry 716 29.4.3 Standardization 717 29.4.4 Deriving New Variables 719 29.5 Data Understanding Phase, Part 2: Exploratory Data Analysis 721 29.5.1 Exploring the Relationships between the Predictors and the Response 722 29.5.2 Investigating the Correlation Structure among the Predictors 727 29.5.3 Importance of De-Transforming for Interpretation 730 CHAPTER 30 CASE STUDY, PART 2: CLUSTERING AND PRINCIPAL COMPONENTS ANALYSIS 732 30.1 Partitioning the Data 732 30. 1.1 Validating the Partition 732 30.2 Developing the Principal Components 733 30.3 Validating the Principal Components 737 30.4 Profiling the Principal Components 737 30.5 Choosing the Optimal Number of Clusters Using Birch Clustering 742 30.6 Choosing the Optimal Number of Clusters Using ¿-Means Clustering 744 30.7 Application of ¿-Means Clustering 745 30.8 Validating the Clusters 745 30.9 Profiling the Clusters 745 chapter 31 CASE STUDY, PART 3: MODEUNG AND EVALUATION FOR PERFORMANCE AND INTERPRETA BILITY 749 31.1 Do you Prefer the Best Model Performance, or a Combination of Performance and Interpretability? 749 31.2 Modeling and Evaluation Overview 750 31.3 Cost-Benefit Analysis Using Data-Driven Costs 751 31.3.1 Calculating Direct Costs 752 31.4 Variables to be Input to the Models 753
CONTENTS Xl x 31.5 Establishing the Baseline Model Performance 754 31.6 Models that use Misclassification Costs 755 31.7 Models that Need Rebalancing as a Surrogate for Misclassification Costs 756 31.8 Combining Models Using Voting and Propensity Averaging 757 31.9 Interpreting the Most Profitable Model 758 CHAPTER 32 CASE STUDY, PART 4: MODELING AND EVALUATION FOR HIGH PERFORMANCE ONLY 762 32.1 Variables to be Input to the Models 762 32.2 Models that use Misclassification Costs 762 32.3 Models that Need Rebalancing as a Surrogate for Misclassification Costs 764 32.4 Combining Models using Voting and Propensity Averaging 765 32.5 Lessons Learned 766 32.6 Conclusions 766 APPENDIX A DATA SUMMARIZATION AND VISUALIZATION 768 Part 1: Summarization 1: Building Blocks of Data Analysis 768 Part 2: Visualization: Graphs and Tables for Summarizing and Organizing Data 770 Part 3: Summarization 2: Measures of Center, Variability, and Position 774 Part 4: Summarization and Visualization of Bivariate Relationships 777 INDEX 781
|
any_adam_object | 1 |
author | Larose, Daniel T. Larose, Chantal D. |
author_GND | (DE-588)1062529189 (DE-588)1062529227 |
author_facet | Larose, Daniel T. Larose, Chantal D. |
author_role | aut aut |
author_sort | Larose, Daniel T. |
author_variant | d t l dt dtl c d l cd cdl |
building | Verbundindex |
bvnumber | BV042017500 |
classification_rvk | ST 530 |
classification_tum | DAT 703f |
ctrlnum | (OCoLC)908652508 (DE-599)BSZ407445226 |
dewey-full | 006.3/12 |
dewey-hundreds | 000 - Computer science, information, general works |
dewey-ones | 006 - Special computer methods |
dewey-raw | 006.3/12 |
dewey-search | 006.3/12 |
dewey-sort | 16.3 212 |
dewey-tens | 000 - Computer science, information, general works |
discipline | Informatik |
edition | 2. ed. |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01715nam a22004092c 4500</leader><controlfield tag="001">BV042017500</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20190417 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">140808s2015 ad|| |||| 00||| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781118116197</subfield><subfield code="c">hbk.</subfield><subfield code="9">978-1-118-11619-7</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)908652508</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BSZ407445226</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-91G</subfield><subfield code="a">DE-1050</subfield><subfield code="a">DE-1049</subfield><subfield code="a">DE-29T</subfield><subfield code="a">DE-384</subfield><subfield code="a">DE-91</subfield><subfield code="a">DE-862</subfield><subfield code="a">DE-M382</subfield><subfield code="a">DE-859</subfield><subfield code="a">DE-706</subfield><subfield code="a">DE-945</subfield><subfield code="a">DE-573</subfield><subfield code="a">DE-11</subfield><subfield code="a">DE-Aug4</subfield><subfield code="a">DE-739</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">006.3/12</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 530</subfield><subfield code="0">(DE-625)143679:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">DAT 703f</subfield><subfield code="2">stub</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Larose, Daniel T.</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1062529189</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Data mining and predictive analytics</subfield><subfield code="c">Daniel T. Larose ; Chantal D. Larose</subfield></datafield><datafield tag="250" ind1=" " ind2=" "><subfield code="a">2. ed.</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Hoboken, NJ</subfield><subfield code="b">Wiley</subfield><subfield code="c">2015</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XXIX, 794 S.</subfield><subfield code="b">Ill., graph. Darst.</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="0" ind2=" "><subfield code="a">Wiley series on methods and applications in data mining</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Auch angekündigt u.d.T.: Data mining methods and models</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Vorhersagetheorie</subfield><subfield code="0">(DE-588)4188671-9</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Vorhersagetheorie</subfield><subfield code="0">(DE-588)4188671-9</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Larose, Chantal D.</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1062529227</subfield><subfield code="4">aut</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">HBZ Datenaustausch</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027459262&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-027459262</subfield></datafield></record></collection> |
id | DE-604.BV042017500 |
illustrated | Illustrated |
indexdate | 2024-11-28T04:01:05Z |
institution | BVB |
isbn | 9781118116197 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-027459262 |
oclc_num | 908652508 |
open_access_boolean | |
owner | DE-91G DE-BY-TUM DE-1050 DE-1049 DE-29T DE-384 DE-91 DE-BY-TUM DE-862 DE-BY-FWS DE-M382 DE-859 DE-706 DE-945 DE-573 DE-11 DE-Aug4 DE-739 |
owner_facet | DE-91G DE-BY-TUM DE-1050 DE-1049 DE-29T DE-384 DE-91 DE-BY-TUM DE-862 DE-BY-FWS DE-M382 DE-859 DE-706 DE-945 DE-573 DE-11 DE-Aug4 DE-739 |
physical | XXIX, 794 S. Ill., graph. Darst. |
publishDate | 2015 |
publishDateSearch | 2015 |
publishDateSort | 2015 |
publisher | Wiley |
record_format | marc |
series2 | Wiley series on methods and applications in data mining |
spellingShingle | Larose, Daniel T. Larose, Chantal D. Data mining and predictive analytics Data Mining (DE-588)4428654-5 gnd Vorhersagetheorie (DE-588)4188671-9 gnd |
subject_GND | (DE-588)4428654-5 (DE-588)4188671-9 |
title | Data mining and predictive analytics |
title_auth | Data mining and predictive analytics |
title_exact_search | Data mining and predictive analytics |
title_full | Data mining and predictive analytics Daniel T. Larose ; Chantal D. Larose |
title_fullStr | Data mining and predictive analytics Daniel T. Larose ; Chantal D. Larose |
title_full_unstemmed | Data mining and predictive analytics Daniel T. Larose ; Chantal D. Larose |
title_short | Data mining and predictive analytics |
title_sort | data mining and predictive analytics |
topic | Data Mining (DE-588)4428654-5 gnd Vorhersagetheorie (DE-588)4188671-9 gnd |
topic_facet | Data Mining Vorhersagetheorie |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027459262&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT larosedanielt dataminingandpredictiveanalytics AT larosechantald dataminingandpredictiveanalytics |
Inhaltsverzeichnis
Sonderstandort Fakultät
Signatur: |
2000 ST 530 L331(2) |
---|---|
Exemplar 1 | nicht ausleihbar Checked out – Rückgabe bis: 31.12.2099 Vormerken |