Data mining with R: learning with case studies
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Boca Raton
CRC Press
[2017]
|
Ausgabe: | Second edition |
Schriftenreihe: | Data mining and knowledge discovery series
|
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis |
Beschreibung: | Includes index |
Beschreibung: | xix, 405 Seiten Illustrationen, Diagramme |
ISBN: | 9781482234893 9781315399102 |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV043957796 | ||
003 | DE-604 | ||
005 | 20180308 | ||
007 | t | ||
008 | 161213s2017 xxua||| |||| 00||| eng d | ||
010 | |a 016024995 | ||
020 | |a 9781482234893 |c hardback |9 978-1-4822-3489-3 | ||
020 | |a 9781315399102 |9 978-1-315-39910-2 | ||
035 | |a (OCoLC)970352491 | ||
035 | |a (DE-599)BVBBV043957796 | ||
040 | |a DE-604 |b ger |e rda | ||
041 | 0 | |a eng | |
044 | |a xxu |c US | ||
049 | |a DE-739 |a DE-355 |a DE-824 |a DE-384 |a DE-19 |a DE-20 |a DE-11 | ||
050 | 0 | |a QA76.9.D343 | |
082 | 0 | |a 006.3/12 |2 23 | |
084 | |a CM 4000 |0 (DE-625)18951: |2 rvk | ||
084 | |a ST 250 |0 (DE-625)143626: |2 rvk | ||
084 | |a ST 530 |0 (DE-625)143679: |2 rvk | ||
084 | |a ST 601 |0 (DE-625)143682: |2 rvk | ||
084 | |a WC 7700 |0 (DE-625)148144: |2 rvk | ||
100 | 1 | |a Torgo, Luís |e Verfasser |4 aut | |
245 | 1 | 0 | |a Data mining with R |b learning with case studies |c Luís Torgo, University of Porto, Portugal |
250 | |a Second edition | ||
264 | 1 | |a Boca Raton |b CRC Press |c [2017] | |
300 | |a xix, 405 Seiten |b Illustrationen, Diagramme | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
490 | 0 | |a Data mining and knowledge discovery series | |
500 | |a Includes index | ||
650 | 4 | |a Data mining |v Case studies | |
650 | 4 | |a R (Computer program language) | |
650 | 0 | 7 | |a Data Mining |0 (DE-588)4428654-5 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a R |g Programm |0 (DE-588)4705956-4 |2 gnd |9 rswk-swf |
655 | 7 | |0 (DE-588)4522595-3 |a Fallstudiensammlung |2 gnd-content | |
689 | 0 | 0 | |a Data Mining |0 (DE-588)4428654-5 |D s |
689 | 0 | 1 | |a R |g Programm |0 (DE-588)4705956-4 |D s |
689 | 0 | |C b |5 DE-604 | |
856 | 4 | 2 | |m Digitalisierung UB Regensburg - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=029366500&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-029366500 |
Datensatz im Suchindex
_version_ | 1804176911961686016 |
---|---|
adam_text | Contents
Preface xi
Acknowledgments xiii
List of Figures xv
List of Tables xix
1 Introduction 1
1.1 How to Read This Book................................................. 2
1.2 Reproducibility ........................................................ 3
1 R and Data Mining 5
2 Introduction to R 7
2.1 Starting with R ........................................................ 7
2.2 Basic Interaction with the R Console ................................... 9
2.3 R Objects and Variables................................................ 10
2.4 R Functions ........................................................ 12
2.5 Vectors ............................................................... 16
2.6 Vectorization.......................................................... 18
2.7 Factors ............................................................... 19
2.8 Generating Sequences.................................................. 22
2.9 Sub-Setting.......................................................... 24
2.10 Matrices and Arrays.................................................. 26
2.11 Lists................................................................. 30
2.12 Data Frames .......................................................... 32
2.13 Useful Extensions to Data Frames . . . ............................... 36
2.14 Objects, Classes, and Methods......................................... 40
2.15 Managing Your Sessions................................................ 41
3 Introduction to Data Mining 43
3.1 A Bird’s Eye View on Data Mining ...................................... 43
3.2 Data Collection .and Business Understanding............................ 45
3.2.1 Data and Datasets............................................... 45
3.2.2 Importing Data into R .......................................... 46
3.2.2.1 Text Files.............................................. 47
3.2.2.2 Databases............................................... 49
3.2.2.3 Spreadsheets............................................ 52
vii
viii Contents
3.2.2.4 Other Formats............................................. 52
3.3 Data Pre-Processing...................................................... 53
3.3.1 Data Cleaning.................................................... 53
3.3.1.1 Tidy Data................................................. 53
3.3.1.2 Handling Dates .......................................... 56
3.3.1.3 String Processing ........................................ 58
3.3.1.4 Dealing with Unknown Values............................... 60
3.3.2 Transforming Variables ........................................... 62
3.3.2.1 Handling Different Scales of Variables.................... 62
3.3.2.2 Discretizing Variables . ................................. 63
3.3.3 Creating Variables............................................... 65
3.3.3.1 Handling Case Dependencies................................ 65
3.3.3.2 Handling Text Datasets.................................... 74
3.3.4 Dimensionality Reduction.......................................... 78
3.3.4.1 Sampling Rows............................................. 78
3.3.4.2 Variable Selection........................................ 82
3.4 Modeling ............................................................ 87
3.4.1 Exploratory Data Analysis ....................................... 87
3.4.1.1 Data Summarization...................................... 87
3.4.1.2 Data Visualization........................................ 96
3.4.2 Dependency Modeling using Association Rules ..................... 110
3.4.3 Clustering....................................................... 119
3.4.3.1 Measures of Dissimilarity................................ 119
3.4.3.2 Clustering Methods....................................... 120
3.4.4 Anomaly Detection................................................ 131
3.4.4.1 Univariate Outlier Detection Methods..................... 132
3.4.4.2 Multi-Variate Outlier Detection Methods.................. 133
3.4.5 Predictive Analytics............................................. 140
3.4.5.1 Evaluation Metrics....................................... 141
3.4.5.2 Tree-Based Models........................................ 145
3.4.5.3 Support Vector Machines.................................. 151
3.4.5.4 Artificial Neural Networks and Deep Learning ....... 158
3.4.5.5 Model Ensembles ......................................... 165
3.5 Evaluation ............................................................. 172
3.5.1 The Holdout and Random Subsampling.............................. 174
3.5.2 Cross Validation................................................ 177
3.5.3 Bootstrap Estimates............................................. 179
3.5.4 Recommended Procedures........................................... 181
3.6 Reporting and Deployment................................................ 182
3.6.1 Reporting Through Dynamic Documents.............................. 183
3.6.2 Deployment through Web Applications.............................. 186
II Case Studies 191
4 Predicting Algae Blooms 193
*
4.1 Problem Description and Objectives...................................... 193
4.2 Data Description........................................................ 194
.4.3 Loading the Data into R ......................... ...................... 194
4.4 Data Visualization and Summarization ................................... 196
4.5 Unknown Values......................................................... 205
Contents
IX
4.5.1 Removing the Observations with Unknown Values..................... 205
4.5.2 Filling in the Unknowns with the Most Frequent Values............. 207
4.5.3 Filling in the Unknown Values by Exploring Correlations........... 208
4.5.4 Filling in the Unknown Values by Exploring Similarities between
Cases........................................................... 212
4.6 Obtaining Prediction Models . ......................................... . 214
4.6.1 Multiple Linear Regression........................................ 215
4.6.2 Regression Trees.................................................. 220
4.7 Model Evaluation and Selection ..................................... 225
4.8 Predictions for the Seven Algae ..................................... 237
4.9 Summary............................................................... 239
5 Predicting Stock Market Returns 241
5.1 Problem Description and Objectives..................................... 241
5.2 The Available Data . .................................................. 242
5.2.1 Reading the Data from the CSV File............................... 243
5.2.2 Getting the Data from the Web..................................... 243
5.3 Defining the Prediction Tasks .......................................... 244
5.3.1 What to Predict?................................................ 244
5.3.2 Which Predictors?................................................. 247
5.3.3 The Prediction Tasks ........................................... 251
5.3.4 Evaluation Criteria . . .......................................... 252
5.4 The Prediction Models .................................................. 254
5.4.1 How Will the Training Data Be Used? .............................. 254
5.4.2 The Modeling Tools................................................ 256
5.4.2.1 Artificial Neural Networks .............................. 256
5.4.2.2 Support Vector Machines.................................. 259
5.4.2.3 Multivariate Adaptive Regression Splines................ 260
5.5 From Predictions into Actions .......................................... 263
5.5.1 How Will the Predictions Be Used?................................. 263
5.5.2 Trading-Related Evaluation Criteria............................... 264
5.5.3 Putting Everything Together: A Simulated Trader................... 265
5.6 Model Evaluation and Selection . . . . ................................ 271
5.6.1 Monte Carlo Estimates ............................................ 271
5.6.2 Experimental Comparisons.......................................... 272
5.6.3 Results Analysis.................................................. 278
5.7 The Trading System..................................................... 286
5.7.1 Evaluation of the Final Test Data................................. 286
5.7.2 An Online Trading System.......................................... 291
5.8 Summary................................................................ 292
6 Detecting Fraudulent Transactions 295
6.1 Problem Description and Objectives...................................... 295
6.2 The Available Data ..................................................... 296
6.2.1 Loading the Data into R ........................................ 296
6.2.2 Exploring the Dataset............................................. 297
6.2.3 Data Problems..................................................... 304
6.2.3.1 Unknown Values........................................... 304
6.2.3.2 Few Transactions of Some Products....................... 309
X
Contents
6.3 Defining the Data Mining Tasks ......................................... 313
6.3.1 Different Approaches to the Problem ............................. 313
6.3.1.1 Unsupervised Techniques................................. 313
6.3.1.2 Supervised Techniques.................................. 314
6.3.1.3 Semi-Supervised Techniques ............................. 315
6.3.2 Evaluation Criteria.............................................. 316
6.3.2.1 Precision and Recall.................................... 316
6.3.2.2 Lift Charts and Precision/Recall Curves................. 317
6.3.2.3 Normalized Distance to Typical Price................... 320
6.3.3 Experimental Methodology......................................... 321
6.4 Obtaining Outlier Rankings ............................................. 323
6.4.1 Unsupervised Approaches.......................................... 323
6.4.1.1 The Modified Box Plot Rule.............................. 323
6.4.1.2 Local Outlier Factors (LOF)............................. 327
6.4.1.3 Clustering-Based Outlier Rankings (ORh)................. 330
6.4.2 Supervised Approaches .......................................... 332
6.4.2.1 The Class Imbalance Problem............................. 333
6.4.2.2 Naive Bayes ........................................... 335
6.4.2.3 AdaBoost................................................ 339
6.4.3 Semi-Supervised Approaches....................................... 344
6.5 Summary................................................................ 350
7 Classifying Microarray Samples 353
7.1 Problem Description and Objectives...................................... 353
7.1.1 Brief Background on Microarray Experiments....................... 353
7.1.2 The ALL Dataset.................................................. 354
7.2 The Available Data ..................................................... 354
7.2.1 Exploring the Dataset........................................... 357
7.3 Gene (Feature) Selection ............................................... 359
7.3.1 Simple Filters Based on Distribution Properties.................. 360
7.3.2 ANOVA Filters................................................... 362
7.3.3 Filtering Using Random Forests ................................ . 364
7.3.4 Filtering Using Feature Clustering Ensembles..................... 367
7.4 Predicting Cytogenetic Abnormalities.................................... 368
7.4.1 Defining the Prediction Task..................................... 368
7.4.2 The Evaluation Metric............................................ 369
7.4.3 The Experimental Procedure....................................... 369
7.4.4 The Modeling Techniques.......................................... 370
7.4.5 Comparing the Models............................................. 373
7.5 Summary............................................................... 381
Bibliography 383
Subject Index 395
Index of Data Mining Topics 399
Index of R Functions
401
|
any_adam_object | 1 |
author | Torgo, Luís |
author_facet | Torgo, Luís |
author_role | aut |
author_sort | Torgo, Luís |
author_variant | l t lt |
building | Verbundindex |
bvnumber | BV043957796 |
callnumber-first | Q - Science |
callnumber-label | QA76 |
callnumber-raw | QA76.9.D343 |
callnumber-search | QA76.9.D343 |
callnumber-sort | QA 276.9 D343 |
callnumber-subject | QA - Mathematics |
classification_rvk | CM 4000 ST 250 ST 530 ST 601 WC 7700 |
ctrlnum | (OCoLC)970352491 (DE-599)BVBBV043957796 |
dewey-full | 006.3/12 |
dewey-hundreds | 000 - Computer science, information, general works |
dewey-ones | 006 - Special computer methods |
dewey-raw | 006.3/12 |
dewey-search | 006.3/12 |
dewey-sort | 16.3 212 |
dewey-tens | 000 - Computer science, information, general works |
discipline | Biologie Informatik Psychologie |
edition | Second edition |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>02011nam a2200517 c 4500</leader><controlfield tag="001">BV043957796</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20180308 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">161213s2017 xxua||| |||| 00||| eng d</controlfield><datafield tag="010" ind1=" " ind2=" "><subfield code="a">016024995</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781482234893</subfield><subfield code="c">hardback</subfield><subfield code="9">978-1-4822-3489-3</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781315399102</subfield><subfield code="9">978-1-315-39910-2</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)970352491</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV043957796</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rda</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="044" ind1=" " ind2=" "><subfield code="a">xxu</subfield><subfield code="c">US</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-739</subfield><subfield code="a">DE-355</subfield><subfield code="a">DE-824</subfield><subfield code="a">DE-384</subfield><subfield code="a">DE-19</subfield><subfield code="a">DE-20</subfield><subfield code="a">DE-11</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">QA76.9.D343</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">006.3/12</subfield><subfield code="2">23</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">CM 4000</subfield><subfield code="0">(DE-625)18951:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 250</subfield><subfield code="0">(DE-625)143626:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 530</subfield><subfield code="0">(DE-625)143679:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 601</subfield><subfield code="0">(DE-625)143682:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">WC 7700</subfield><subfield code="0">(DE-625)148144:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Torgo, Luís</subfield><subfield code="e">Verfasser</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Data mining with R</subfield><subfield code="b">learning with case studies</subfield><subfield code="c">Luís Torgo, University of Porto, Portugal</subfield></datafield><datafield tag="250" ind1=" " ind2=" "><subfield code="a">Second edition</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Boca Raton</subfield><subfield code="b">CRC Press</subfield><subfield code="c">[2017]</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">xix, 405 Seiten</subfield><subfield code="b">Illustrationen, Diagramme</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="0" ind2=" "><subfield code="a">Data mining and knowledge discovery series</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Includes index</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data mining</subfield><subfield code="v">Case studies</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">R (Computer program language)</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">R</subfield><subfield code="g">Programm</subfield><subfield code="0">(DE-588)4705956-4</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="655" ind1=" " ind2="7"><subfield code="0">(DE-588)4522595-3</subfield><subfield code="a">Fallstudiensammlung</subfield><subfield code="2">gnd-content</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">R</subfield><subfield code="g">Programm</subfield><subfield code="0">(DE-588)4705956-4</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="C">b</subfield><subfield code="5">DE-604</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Regensburg - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=029366500&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-029366500</subfield></datafield></record></collection> |
genre | (DE-588)4522595-3 Fallstudiensammlung gnd-content |
genre_facet | Fallstudiensammlung |
id | DE-604.BV043957796 |
illustrated | Illustrated |
indexdate | 2024-07-10T07:39:43Z |
institution | BVB |
isbn | 9781482234893 9781315399102 |
language | English |
lccn | 016024995 |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-029366500 |
oclc_num | 970352491 |
open_access_boolean | |
owner | DE-739 DE-355 DE-BY-UBR DE-824 DE-384 DE-19 DE-BY-UBM DE-20 DE-11 |
owner_facet | DE-739 DE-355 DE-BY-UBR DE-824 DE-384 DE-19 DE-BY-UBM DE-20 DE-11 |
physical | xix, 405 Seiten Illustrationen, Diagramme |
publishDate | 2017 |
publishDateSearch | 2017 |
publishDateSort | 2017 |
publisher | CRC Press |
record_format | marc |
series2 | Data mining and knowledge discovery series |
spelling | Torgo, Luís Verfasser aut Data mining with R learning with case studies Luís Torgo, University of Porto, Portugal Second edition Boca Raton CRC Press [2017] xix, 405 Seiten Illustrationen, Diagramme txt rdacontent n rdamedia nc rdacarrier Data mining and knowledge discovery series Includes index Data mining Case studies R (Computer program language) Data Mining (DE-588)4428654-5 gnd rswk-swf R Programm (DE-588)4705956-4 gnd rswk-swf (DE-588)4522595-3 Fallstudiensammlung gnd-content Data Mining (DE-588)4428654-5 s R Programm (DE-588)4705956-4 s b DE-604 Digitalisierung UB Regensburg - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=029366500&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Torgo, Luís Data mining with R learning with case studies Data mining Case studies R (Computer program language) Data Mining (DE-588)4428654-5 gnd R Programm (DE-588)4705956-4 gnd |
subject_GND | (DE-588)4428654-5 (DE-588)4705956-4 (DE-588)4522595-3 |
title | Data mining with R learning with case studies |
title_auth | Data mining with R learning with case studies |
title_exact_search | Data mining with R learning with case studies |
title_full | Data mining with R learning with case studies Luís Torgo, University of Porto, Portugal |
title_fullStr | Data mining with R learning with case studies Luís Torgo, University of Porto, Portugal |
title_full_unstemmed | Data mining with R learning with case studies Luís Torgo, University of Porto, Portugal |
title_short | Data mining with R |
title_sort | data mining with r learning with case studies |
title_sub | learning with case studies |
topic | Data mining Case studies R (Computer program language) Data Mining (DE-588)4428654-5 gnd R Programm (DE-588)4705956-4 gnd |
topic_facet | Data mining Case studies R (Computer program language) Data Mining R Programm Fallstudiensammlung |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=029366500&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT torgoluis dataminingwithrlearningwithcasestudies |