Data science bookcamp: five Python projects
1. Computing probabilities using Python -- 2. Plotting probabilities using Matplotlib -- 3. Running random simulations in NumPy -- 4. Case study 1 solution -- 5. Basic probability and statistical analysis using SciPy -- 6. Making predictions using the central limit theorem and SciPy -- 7. Statistica...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Shelter Island
Manning
[2021]
|
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis Inhaltsverzeichnis |
Zusammenfassung: | 1. Computing probabilities using Python -- 2. Plotting probabilities using Matplotlib -- 3. Running random simulations in NumPy -- 4. Case study 1 solution -- 5. Basic probability and statistical analysis using SciPy -- 6. Making predictions using the central limit theorem and SciPy -- 7. Statistical hypothesis testing -- 8. Analyzing tables using Pandas -- 9. Case study 2 solution -- 10. Clustering data into groups -- 11. Geographic location visualization and analysis -- 12. Case study 3 solution -- 13. Measuring text similarities -- 14. Dimension reduction of matrix data -- 15. NLP analysis of large text datasets -- 16. Extracting text from web pages -- 17. Case study 4 solution -- 18. An introduction to graph theory and network analysis -- 19. Dynamic graph theory techniques for node ranking and social network analysis -- 20. Network-driven supervised machine learning -- 21. Training linear classifiers with logistic regression -- 22. Training nonlinear classifiers with decision tree techniques -- 23. Case study 5 solution. |
Beschreibung: | Untertitel auf Cover: Five real-world Python projects Literaturangaben |
Beschreibung: | xxvi, 676 Seiten Diagramme, Illustrationen 24 cm |
ISBN: | 9781617296253 |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV048630066 | ||
003 | DE-604 | ||
005 | 20230213 | ||
007 | t | ||
008 | 230104s2021 xxua||| |||| 00||| eng d | ||
020 | |a 9781617296253 |c softcover |9 978-1-61729-625-3 | ||
035 | |a (OCoLC)1294301294 | ||
035 | |a (DE-599)KXP178714853X | ||
040 | |a DE-604 |b ger |e rda | ||
041 | 0 | |a eng | |
044 | |a xxu |c XD-US | ||
049 | |a DE-739 | ||
082 | 0 | |a 006.312 | |
084 | |a ST 300 |0 (DE-625)143650: |2 rvk | ||
100 | 1 | |a Apeltsin, Leonard |d ca. 20./21. Jh. |e Verfasser |0 (DE-588)1280840129 |4 aut | |
245 | 1 | 0 | |a Data science bookcamp |b five Python projects |c Leonard Apeltsin |
264 | 1 | |a Shelter Island |b Manning |c [2021] | |
300 | |a xxvi, 676 Seiten |b Diagramme, Illustrationen |c 24 cm | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
500 | |a Untertitel auf Cover: Five real-world Python projects | ||
500 | |a Literaturangaben | ||
520 | 3 | |a 1. Computing probabilities using Python -- 2. Plotting probabilities using Matplotlib -- 3. Running random simulations in NumPy -- 4. Case study 1 solution -- 5. Basic probability and statistical analysis using SciPy -- 6. Making predictions using the central limit theorem and SciPy -- 7. Statistical hypothesis testing -- 8. Analyzing tables using Pandas -- 9. Case study 2 solution -- 10. Clustering data into groups -- 11. Geographic location visualization and analysis -- 12. Case study 3 solution -- 13. Measuring text similarities -- 14. Dimension reduction of matrix data -- 15. NLP analysis of large text datasets -- 16. Extracting text from web pages -- 17. Case study 4 solution -- 18. An introduction to graph theory and network analysis -- 19. Dynamic graph theory techniques for node ranking and social network analysis -- 20. Network-driven supervised machine learning -- 21. Training linear classifiers with logistic regression -- 22. Training nonlinear classifiers with decision tree techniques -- 23. Case study 5 solution. | |
650 | 0 | 7 | |a Data Science |0 (DE-588)1140936166 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Data Mining |0 (DE-588)4428654-5 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Python |g Programmiersprache |0 (DE-588)4434275-5 |2 gnd |9 rswk-swf |
653 | 0 | |a Data mining | |
653 | 0 | |a Data sets | |
653 | 0 | |a Python (Computer program language) | |
689 | 0 | 0 | |a Data Science |0 (DE-588)1140936166 |D s |
689 | 0 | 1 | |a Data Mining |0 (DE-588)4428654-5 |D s |
689 | 0 | 2 | |a Python |g Programmiersprache |0 (DE-588)4434275-5 |D s |
689 | 0 | |5 DE-604 | |
856 | 4 | 2 | |u https://www.gbv.de/dms/bowker/toc/9781617296253.pdf |v 2022-07-28 |x Aggregator |3 Inhaltsverzeichnis |
856 | 4 | 2 | |m Digitalisierung UB Passau - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=034005124&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-034005124 |
Datensatz im Suchindex
_version_ | 1804184761679216640 |
---|---|
adam_text | brief contents Case study 1 Finding the winning strategy in 1 A CARD GAME......................................... 1 ■ Computing probabilities using Python 2 ■ 3 Plotting probabilities using Matplotlib 3 17 ■ Running random simulations in NumPy 33 4 ■ Case study 1 solution 58 Case study 2 Assessing online ad clicks for SIGNIFICANCE..................................... 69 5 i■ Basic probability and statistical analysis using SciPy 71 6 1■ Making predictions using the central limit theorem and SciPy 94 7 i■ Statistical hypothesis testing 8 ■ Analyzing tables using Pandas 9 ■ Case study 2 solution vii 154 114 137
viii Case study 3 BRIEF CONTENTS Tracking disease outbreaks using news HEADLINES........ ................................................................... 165 10 ■ Clustering data into groups 167 11 ■ Geographic location visualization and analysis 12 ■ Case study 3 solution Case study 4 Using 194 226 online job postings to improve: YOUR DATA SCIENCE RESUME ...................................... 245 13 ■ Measuring text similarities 249 14 ■ Dimension reduction of matrix data 15 ■ NLP analysis of large text datasets 16 ■ Extracting text from web pages 17 ■ Case study 4 solution Case study 5 Predicting 292 340 385 404 euture friendships from SOCIAL NETWORK DATA....... . .......... 445 18 ■ An introduction to graph theory and network analysis 451 19 ■ Dynamic graph theory techniques for node ranking and social network analysis 482 20 ■ Network-driven supervised machine learning 518 21 ■ Training linear classifiers with logistic regression 22 ■ Training nonlinear classifiers with decision tree techniques 586 23 ■ Case study 5 solution 634 548
contents preface xvii acknowledgments xix about this book xxi about the author xxv about the cover illustration Case study 1 xxvi Finding the winning strategy in A CARD GAME .................................. Ί Computing probabilities using Python 3 3 1.1 Sample space analysis: An equation-free approach for measuring uncertainty in outcomes 4 Analyzing a biased coin 7 1.2 Computing nontrivial probabilities 8 Problem 1: Analyzing a family with four children 8 ■ Problem 2: Analyzing multiple die rolls 10 ■ Problem 3: Computing die-roll probabilities using weighted sample spaces 11 1.3 Computing probabilities over interval ranges 13 Evaluating extremes using interval analysis ix 13 1
CONTENTS x Q Plotting probabilities using Matplotlib 21 2.2 Basic Matplotlib plots 17 Plotting coin-flip probabilities 17 22 Comparing multiple coin-flip probability distributions J Running random simulations in NumPy 3.1 33 Simulating random coin flips and die rolls using NumPy 34 Analyzing biased coin flips 3.2 26 36 Computing confidence intervals using histograms and NumPy arrays 38 Binning similar points in histogram plots 41 Deriving probabilities from histograms 43 ■ Shrinking the range of a high confidence interval 46 ■ Computing histograms in NumPy 49 3.3 3.4 Using confidence intervals to analyze a biased deck of cards 51 Using permutations to shuffle cards 54 Case study 1 solution 4.1 58 Predicting red cards in a shuffled deck 59 Estimating the probability of strategy success 4.2 Case study 60 Optimizing strategies using the sample space for a 1 О-card deck 64 2 Assessing online ad clicks for 69 SIGNIFICANCE........ . ....... 4.3 4.4 4.5 X Problem statement 69 Dataset description 70 Overview 70 Basic probability and statistical analysis using SciPy 5.1 5.2 Exploring the relationships between data and probability using SciPy 72 Mean as a measure of centrality 76 Finding the mean of a probability distribution 5.3 Variance as a measure of dispersion 83 85 Finding the variance of a probability distribution 90 71
CONTENTS 6 Making predictions using the central limit theorem and SciPy 6.1 Manipulating the normal distribution using SciPy 95 Comparing two sampled normal curves 99 6.2 Determining the mean and variance of a population through random sampling 103 Making predictions using the mean and variance 107 Computing the area beneath a normal curve 109 ■ Interpreting the computed probability 112 6.3 ^7 Statistical hypothesis testing * xi 7.1 7.2 7.3 7.4 Assessing the divergence between sample mean and population mean 115 Data dredging: Coming to false conclusions through oversampling 121 Bootstrapping with replacement: Testing a hypothesis when the population variance is unknown 124 Permutation testing: Comparing means of samples when the population parameters are unknown 132 О Analyzing tables using Pandas 8.1 8.2 8.3 8.4 8.5 8.6 8.7 137 Storing tables using basic Python 138 Exploring tables using Pandas 138 Retrieving table columns 141 Retrieving table rows 143 Modifying table rows and columns 145 Saving and loading table data 148 Visualizing tables using Seaborn 149 ^ Case study 2 solution 9.1 9.2 9.3 9.4 114 154 Processing the ad-click table in Pandas 155 Computing p-values from differences in means 157 Determining statistical significance 161 41 shades of blue: A real-life cautionary tale 162 94
CONTENTS Case study 3 Tracking disease outbreaks using ..165 NEWS HEADLINES.... 1 - 9.5 Problem statement 165 Dataset description 165 9.6 Overview 166 / J Clustering data into groups 167 10.1 10.2 Using centrality to discover clusters 168 К-means: A clustering algorithm for grouping data into К central groups 174 К-means clustering using scikit-learn 175 * Selecting the optimal К using the elbow method 177 10.3 10.4 Using density to discover clusters 181 DBSCAN: A clustering algorithm for grouping data based on spadal density 185 Comparing DBSCAN and K-means 186՝ Clustering based on non-Euclidean distance 187 10.5 Analyzing clusters using Pandas 191 Geographic location visualization and analysis 11.1 11.2 194 The great-circle distance: A metric for computing the distance between two global points 195 Plotting maps using Cartopy 198 Manually installing GEOS and Cartopy 199՝ Utilizing the Conda package manager 199 * Visualizing maps 201 11.3 Location tracking using GeoNamesCache 211 Accessing country information 212 Accessing city information 215 ■ Limitations of the GeoNamesCache library 219 11.4 Matching location names in text 221 Case study 3 solution 12.1 12.2 12.3 226 Extracting locations from headline data 227 Visualizing and clustering the extracted location data 233 Extracting insights from location clusters 238
xiii CONTENTS Case study 4 Using online job postings to improve YOUR DATA SCIENCE RESUME............ .......245 12.4 Problem statement 245 Dataset description 246 12.5 Overview 247 Ί П* Measuring text similarities 249 13.1 Simple text comparison 250 Exploring theJaccard similarity 255 ■ Replacing words with numeric values 25 7 13.2 Vectorizing texts using word counts 262 Using normalization to improve TF vector similarity 264 Using unit vector dot products to convert between relevance metrics 272 13.3 Matrix multiplication for efficient similarity calculation 274 Basic matrix operations 277 · Computing all-by-all matrix similarities 285 13.4 Computational limits of matrix multiplication ƒ /1 Dimension reduction of matrix data T- * 287 292 14.1 Clustering 2D data in one dimension 293 Reducing dimensions using rotation 297 14.2 14.3 Dimension reduction using PCA and seikit-learn Clustering 4D data in two dimensions 315 Limitations of PCA 320 14.4 Computing principal components without rotation Extracting eigenvectors using power iteration 32 7 14.5 Efficient dimension reduction using SVD and scikit-learn 336 309 323 Х NLP analysis of large text datasets 340 15.1 Loading online forum discussions using scikit-learn 341 15.2 Vectorizing documents using scikit-learn 343 15.3 Ranking words by both post frequency and count 350 Computing TFIDF vectors with scikit-learn 356 Г
CONTENTS xiv 15.4 15.5 Computing similarities across large document datasets 358 Clustering texts by topic 363 Exploring a single text cluster 15.6 Visualizing text clusters 368 372 Using subplots to display multiple word clouds 377 Extracting textfrom web pages 385 16.1 16.2 16.3 J^ Í ^ The structure of HTML documents 386 Parsing HTML using Beautiful Soup 394 Downloading and parsing online data 401 Case study 4 solution 17.1 404 Extracting skill requirements from job posting data 405 Exploring the HTML for skill descriptions 17.2 17.3 406 Filtering jobs by relevance 412 Clustering skills in relevant job postings 422 Grouping the job skids into 15 clusters 425 * Investigating the technical skill clusters 431 ■ Investigating the soft-skill clusters 434 ■ Exploring clusters at alternative values of К 436 Analyzing the 700 most relevant postings 440 17.4 Case study 5 Conclusion 443 Predicting future friendships FROM SOCIAL NETWORK DATA.................... .445 17.5 Problem statement 445 Introducing the friend-of-afriend recommendation algorithm Predicting user behavior 446 17.6 Dataset description 447 The Profiles table 447* The Observations table The Friendships table 449 17.7 448 Overview 449 An introduction to graph theory and network analysis 11,.* 18.1 446 Using basic graph theory to rank websites by popularity 452 Analyzing web networks using NetworkX 455 451
CONTENTS 18.2 xv Utilizing undirected graphs to optimize the travel time between towns 465 Modeling a complex network of towns and counties 467 Computing the fastest travel time between nodes 473 Dynamic graph theory techniques for node ranking and social network analysis 482 19.1 Uncovering central nodes based on expected traffic in a network 483 Measuring centrality using traffic simulations 19.2 486 Computing travel probabilities using matrix multiplication 489 Deriving PageRank centrality from probability theory 492 Computing PageRank centrality using NetworkX 496 19.3 19.4 Community detection using Markov clustering 498 Uncovering friend groups in social networks 513 Network-driven supervised machine learning 518 20.1 20.2 The basics of supervised machine learning 519 Measuring predicted label accuracy 527 Scikit-learn’s prediction measurement functions 20.3 20.4 20.5 536 Optimizing KNN performance 537 Running a grid search using scikitdearn 539 Limitations of the KNN algorithm 544 Training linear classifiers with logistic regression 21.1 21.2 Linearly separating customers by size Training a linear classifier 554 548 549 Improving perceptron performance through standardization 21.3 Improving linear classification with logistic regression 565 Running logistic regression on more than two features 21.4 Training linear classifiers using scikitdearn Training multiclass linear models 21.5 21.6 562 572 574 576 Measuring feature importance with coefficients Linear classifier limitations 582 579
CONTENTS xvi Training nonlinear classifiers with decision tree techniques ì«! á,.,« շշ լ Automated learning of logical rules 587 Training a nested if/else model using two features 593 ■ Deciding which feature to split on 599 ■ Training if/ehe models with more than two features 608 22.2 Training decision tree classifiers using scikit-learn Studying cancerous cells usingfeature importance 621 22.3 22.4 Decision tree classifier limitations 624 Improving performance using random forest classification 626 Training random forest classifiers using scikit-learn 22.5 Case study 5 solution 614 630 634 23.1 Exploring the data 635 Examining the profiles 635 ■ Exploring the experimental observations 638 ■ Exploring the Friendships linkage table 641 23.2 23.3 23.4 Training a predictive model using network features Adding profile features to the model 652 Optimizing performance across a steady set of features 657 Interpreting the trained model 659 Why are generalizable models so important? 662 23.5 index 665 645 586
|
adam_txt |
brief contents Case study 1 Finding the winning strategy in 1 A CARD GAME. 1 ■ Computing probabilities using Python 2 ■ 3 Plotting probabilities using Matplotlib 3 17 ■ Running random simulations in NumPy 33 4 ■ Case study 1 solution 58 Case study 2 Assessing online ad clicks for SIGNIFICANCE. 69 5 i■ Basic probability and statistical analysis using SciPy 71 6 1■ Making predictions using the central limit theorem and SciPy 94 7 i■ Statistical hypothesis testing 8 ■ Analyzing tables using Pandas 9 ■ Case study 2 solution vii 154 114 137
viii Case study 3 BRIEF CONTENTS Tracking disease outbreaks using news HEADLINES. . 165 10 ■ Clustering data into groups 167 11 ■ Geographic location visualization and analysis 12 ■ Case study 3 solution Case study 4 Using 194 226 online job postings to improve: YOUR DATA SCIENCE RESUME . 245 13 ■ Measuring text similarities 249 14 ■ Dimension reduction of matrix data 15 ■ NLP analysis of large text datasets 16 ■ Extracting text from web pages 17 ■ Case study 4 solution Case study 5 Predicting 292 340 385 404 euture friendships from SOCIAL NETWORK DATA. . . 445 18 ■ An introduction to graph theory and network analysis 451 19 ■ Dynamic graph theory techniques for node ranking and social network analysis 482 20 ■ Network-driven supervised machine learning 518 21 ■ Training linear classifiers with logistic regression 22 ■ Training nonlinear classifiers with decision tree techniques 586 23 ■ Case study 5 solution 634 548
contents preface xvii acknowledgments xix about this book xxi about the author xxv about the cover illustration Case study 1 xxvi Finding the winning strategy in A CARD GAME . Ί Computing probabilities using Python 3 3 1.1 Sample space analysis: An equation-free approach for measuring uncertainty in outcomes 4 Analyzing a biased coin 7 1.2 Computing nontrivial probabilities 8 Problem 1: Analyzing a family with four children 8 ■ Problem 2: Analyzing multiple die rolls 10 ■ Problem 3: Computing die-roll probabilities using weighted sample spaces 11 1.3 Computing probabilities over interval ranges 13 Evaluating extremes using interval analysis ix 13 1
CONTENTS x Q Plotting probabilities using Matplotlib ' 21 2.2 Basic Matplotlib plots 17 Plotting coin-flip probabilities 17 22 Comparing multiple coin-flip probability distributions J Running random simulations in NumPy 3.1 33 Simulating random coin flips and die rolls using NumPy 34 Analyzing biased coin flips 3.2 26 36 Computing confidence intervals using histograms and NumPy arrays 38 Binning similar points in histogram plots 41 " Deriving probabilities from histograms 43 ■ Shrinking the range of a high confidence interval 46 ■ Computing histograms in NumPy 49 3.3 3.4 Using confidence intervals to analyze a biased deck of cards 51 Using permutations to shuffle cards 54 Case study 1 solution 4.1 58 Predicting red cards in a shuffled deck 59 Estimating the probability of strategy success 4.2 Case study 60 Optimizing strategies using the sample space for a 1 О-card deck 64 2 Assessing online ad clicks for 69 SIGNIFICANCE. . . 4.3 4.4 4.5 X Problem statement 69 Dataset description 70 Overview 70 Basic probability and statistical analysis using SciPy 5.1 5.2 Exploring the relationships between data and probability using SciPy 72 Mean as a measure of centrality 76 Finding the mean of a probability distribution 5.3 Variance as a measure of dispersion 83 85 Finding the variance of a probability distribution 90 71
CONTENTS 6 Making predictions using the central limit theorem and SciPy 6.1 Manipulating the normal distribution using SciPy 95 Comparing two sampled normal curves 99 6.2 Determining the mean and variance of a population through random sampling 103 Making predictions using the mean and variance 107 Computing the area beneath a normal curve 109 ■ Interpreting the computed probability 112 6.3 ^7 Statistical hypothesis testing * xi 7.1 7.2 7.3 7.4 Assessing the divergence between sample mean and population mean 115 Data dredging: Coming to false conclusions through oversampling 121 Bootstrapping with replacement: Testing a hypothesis when the population variance is unknown 124 Permutation testing: Comparing means of samples when the population parameters are unknown 132 О Analyzing tables using Pandas 8.1 8.2 8.3 8.4 8.5 8.6 8.7 137 Storing tables using basic Python 138 Exploring tables using Pandas 138 Retrieving table columns 141 Retrieving table rows 143 Modifying table rows and columns 145 Saving and loading table data 148 Visualizing tables using Seaborn 149 ^ Case study 2 solution 9.1 9.2 9.3 9.4 114 154 Processing the ad-click table in Pandas 155 Computing p-values from differences in means 157 Determining statistical significance 161 41 shades of blue: A real-life cautionary tale 162 94
CONTENTS Case study 3 Tracking disease outbreaks using .165 NEWS HEADLINES. 1 - 9.5 Problem statement 165 Dataset description 165 9.6 Overview 166 / J Clustering data into groups 167 10.1 10.2 Using centrality to discover clusters 168 К-means: A clustering algorithm for grouping data into К central groups 174 К-means clustering using scikit-learn 175 * Selecting the optimal К using the elbow method 177 10.3 10.4 Using density to discover clusters 181 DBSCAN: A clustering algorithm for grouping data based on spadal density 185 Comparing DBSCAN and K-means 186՝ Clustering based on non-Euclidean distance 187 10.5 Analyzing clusters using Pandas 191 Geographic location visualization and analysis 11.1 11.2 194 The great-circle distance: A metric for computing the distance between two global points 195 Plotting maps using Cartopy 198 Manually installing GEOS and Cartopy 199՝ Utilizing the Conda package manager 199 * Visualizing maps 201 11.3 Location tracking using GeoNamesCache 211 Accessing country information 212' Accessing city information 215 ■ Limitations of the GeoNamesCache library 219 11.4 Matching location names in text 221 Case study 3 solution 12.1 12.2 12.3 226 Extracting locations from headline data 227 Visualizing and clustering the extracted location data 233 Extracting insights from location clusters 238
xiii CONTENTS Case study 4 Using online job postings to improve YOUR DATA SCIENCE RESUME. .245 12.4 Problem statement 245 Dataset description 246 12.5 Overview 247 Ί П* Measuring text similarities 249 13.1 Simple text comparison 250 Exploring theJaccard similarity 255 ■ Replacing words with numeric values 25 7 13.2 Vectorizing texts using word counts 262 Using normalization to improve TF vector similarity 264 Using unit vector dot products to convert between relevance metrics 272 13.3 Matrix multiplication for efficient similarity calculation 274 Basic matrix operations 277 · Computing all-by-all matrix similarities 285 13.4 Computational limits of matrix multiplication ƒ /1 Dimension reduction of matrix data T- * 287 292 14.1 Clustering 2D data in one dimension 293 Reducing dimensions using rotation 297 14.2 14.3 Dimension reduction using PCA and seikit-learn Clustering 4D data in two dimensions 315 Limitations of PCA 320 14.4 Computing principal components without rotation Extracting eigenvectors using power iteration 32 7 14.5 Efficient dimension reduction using SVD and scikit-learn 336 309 323 Х NLP analysis of large text datasets 340 15.1 Loading online forum discussions using scikit-learn 341 15.2 Vectorizing documents using scikit-learn 343 15.3 Ranking words by both post frequency and count 350 Computing TFIDF vectors with scikit-learn 356 Г
CONTENTS xiv 15.4 15.5 Computing similarities across large document datasets 358 Clustering texts by topic 363 Exploring a single text cluster 15.6 Visualizing text clusters 368 372 Using subplots to display multiple word clouds 377 Extracting textfrom web pages 385 16.1 16.2 16.3 J^ Í ^ The structure of HTML documents 386 Parsing HTML using Beautiful Soup 394 Downloading and parsing online data 401 Case study 4 solution 17.1 404 Extracting skill requirements from job posting data 405 Exploring the HTML for skill descriptions 17.2 17.3 406 Filtering jobs by relevance 412 Clustering skills in relevant job postings 422 Grouping the job skids into 15 clusters 425 * Investigating the technical skill clusters 431 ■ Investigating the soft-skill clusters 434 ■ Exploring clusters at alternative values of К 436 Analyzing the 700 most relevant postings 440 17.4 Case study 5 Conclusion 443 Predicting future friendships FROM SOCIAL NETWORK DATA. .445 17.5 Problem statement 445 Introducing the friend-of-afriend recommendation algorithm Predicting user behavior 446 17.6 Dataset description 447 The Profiles table 447* The Observations table The Friendships table 449 17.7 448 Overview 449 An introduction to graph theory and network analysis 11,.* 18.1 446 Using basic graph theory to rank websites by popularity 452 Analyzing web networks using NetworkX 455 451
CONTENTS 18.2 xv Utilizing undirected graphs to optimize the travel time between towns 465 Modeling a complex network of towns and counties 467 Computing the fastest travel time between nodes 473 Dynamic graph theory techniques for node ranking and social network analysis 482 19.1 Uncovering central nodes based on expected traffic in a network 483 Measuring centrality using traffic simulations 19.2 486 Computing travel probabilities using matrix multiplication 489 Deriving PageRank centrality from probability theory 492 Computing PageRank centrality using NetworkX 496 19.3 19.4 Community detection using Markov clustering 498 Uncovering friend groups in social networks 513 Network-driven supervised machine learning 518 20.1 20.2 The basics of supervised machine learning 519 Measuring predicted label accuracy 527 Scikit-learn’s prediction measurement functions 20.3 20.4 20.5 536 Optimizing KNN performance 537 Running a grid search using scikitdearn 539 Limitations of the KNN algorithm 544 Training linear classifiers with logistic regression 21.1 21.2 Linearly separating customers by size Training a linear classifier 554 548 549 Improving perceptron performance through standardization 21.3 Improving linear classification with logistic regression 565 Running logistic regression on more than two features 21.4 Training linear classifiers using scikitdearn Training multiclass linear models 21.5 21.6 562 572 574 576 Measuring feature importance with coefficients Linear classifier limitations 582 579
CONTENTS xvi Training nonlinear classifiers with decision tree techniques ì«! á,.,« շշ լ Automated learning of logical rules 587 Training a nested if/else model using two features 593 ■ Deciding which feature to split on 599 ■ Training if/ehe models with more than two features 608 22.2 Training decision tree classifiers using scikit-learn Studying cancerous cells usingfeature importance 621 22.3 22.4 Decision tree classifier limitations 624 Improving performance using random forest classification 626 Training random forest classifiers using scikit-learn 22.5 Case study 5 solution 614 630 634 23.1 Exploring the data 635 Examining the profiles 635 ■ Exploring the experimental observations 638 ■ Exploring the Friendships linkage table 641 23.2 23.3 23.4 Training a predictive model using network features Adding profile features to the model 652 Optimizing performance across a steady set of features 657 Interpreting the trained model 659 Why are generalizable models so important? 662 23.5 index 665 645 586 |
any_adam_object | 1 |
any_adam_object_boolean | 1 |
author | Apeltsin, Leonard ca. 20./21. Jh |
author_GND | (DE-588)1280840129 |
author_facet | Apeltsin, Leonard ca. 20./21. Jh |
author_role | aut |
author_sort | Apeltsin, Leonard ca. 20./21. Jh |
author_variant | l a la |
building | Verbundindex |
bvnumber | BV048630066 |
classification_rvk | ST 300 |
ctrlnum | (OCoLC)1294301294 (DE-599)KXP178714853X |
dewey-full | 006.312 |
dewey-hundreds | 000 - Computer science, information, general works |
dewey-ones | 006 - Special computer methods |
dewey-raw | 006.312 |
dewey-search | 006.312 |
dewey-sort | 16.312 |
dewey-tens | 000 - Computer science, information, general works |
discipline | Informatik |
discipline_str_mv | Informatik |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>02944nam a2200469 c 4500</leader><controlfield tag="001">BV048630066</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20230213 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">230104s2021 xxua||| |||| 00||| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781617296253</subfield><subfield code="c">softcover</subfield><subfield code="9">978-1-61729-625-3</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1294301294</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)KXP178714853X</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rda</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="044" ind1=" " ind2=" "><subfield code="a">xxu</subfield><subfield code="c">XD-US</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-739</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">006.312</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 300</subfield><subfield code="0">(DE-625)143650:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Apeltsin, Leonard</subfield><subfield code="d">ca. 20./21. Jh.</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1280840129</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Data science bookcamp</subfield><subfield code="b">five Python projects</subfield><subfield code="c">Leonard Apeltsin</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Shelter Island</subfield><subfield code="b">Manning</subfield><subfield code="c">[2021]</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">xxvi, 676 Seiten</subfield><subfield code="b">Diagramme, Illustrationen</subfield><subfield code="c">24 cm</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Untertitel auf Cover: Five real-world Python projects</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Literaturangaben</subfield></datafield><datafield tag="520" ind1="3" ind2=" "><subfield code="a">1. Computing probabilities using Python -- 2. Plotting probabilities using Matplotlib -- 3. Running random simulations in NumPy -- 4. Case study 1 solution -- 5. Basic probability and statistical analysis using SciPy -- 6. Making predictions using the central limit theorem and SciPy -- 7. Statistical hypothesis testing -- 8. Analyzing tables using Pandas -- 9. Case study 2 solution -- 10. Clustering data into groups -- 11. Geographic location visualization and analysis -- 12. Case study 3 solution -- 13. Measuring text similarities -- 14. Dimension reduction of matrix data -- 15. NLP analysis of large text datasets -- 16. Extracting text from web pages -- 17. Case study 4 solution -- 18. An introduction to graph theory and network analysis -- 19. Dynamic graph theory techniques for node ranking and social network analysis -- 20. Network-driven supervised machine learning -- 21. Training linear classifiers with logistic regression -- 22. Training nonlinear classifiers with decision tree techniques -- 23. Case study 5 solution.</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Data Science</subfield><subfield code="0">(DE-588)1140936166</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Python</subfield><subfield code="g">Programmiersprache</subfield><subfield code="0">(DE-588)4434275-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="653" ind1=" " ind2="0"><subfield code="a">Data mining</subfield></datafield><datafield tag="653" ind1=" " ind2="0"><subfield code="a">Data sets</subfield></datafield><datafield tag="653" ind1=" " ind2="0"><subfield code="a">Python (Computer program language)</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Data Science</subfield><subfield code="0">(DE-588)1140936166</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Data Mining</subfield><subfield code="0">(DE-588)4428654-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="2"><subfield code="a">Python</subfield><subfield code="g">Programmiersprache</subfield><subfield code="0">(DE-588)4434275-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="u">https://www.gbv.de/dms/bowker/toc/9781617296253.pdf</subfield><subfield code="v">2022-07-28</subfield><subfield code="x">Aggregator</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Passau - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=034005124&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-034005124</subfield></datafield></record></collection> |
id | DE-604.BV048630066 |
illustrated | Illustrated |
index_date | 2024-07-03T21:15:51Z |
indexdate | 2024-07-10T09:44:29Z |
institution | BVB |
isbn | 9781617296253 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-034005124 |
oclc_num | 1294301294 |
open_access_boolean | |
owner | DE-739 |
owner_facet | DE-739 |
physical | xxvi, 676 Seiten Diagramme, Illustrationen 24 cm |
publishDate | 2021 |
publishDateSearch | 2021 |
publishDateSort | 2021 |
publisher | Manning |
record_format | marc |
spelling | Apeltsin, Leonard ca. 20./21. Jh. Verfasser (DE-588)1280840129 aut Data science bookcamp five Python projects Leonard Apeltsin Shelter Island Manning [2021] xxvi, 676 Seiten Diagramme, Illustrationen 24 cm txt rdacontent n rdamedia nc rdacarrier Untertitel auf Cover: Five real-world Python projects Literaturangaben 1. Computing probabilities using Python -- 2. Plotting probabilities using Matplotlib -- 3. Running random simulations in NumPy -- 4. Case study 1 solution -- 5. Basic probability and statistical analysis using SciPy -- 6. Making predictions using the central limit theorem and SciPy -- 7. Statistical hypothesis testing -- 8. Analyzing tables using Pandas -- 9. Case study 2 solution -- 10. Clustering data into groups -- 11. Geographic location visualization and analysis -- 12. Case study 3 solution -- 13. Measuring text similarities -- 14. Dimension reduction of matrix data -- 15. NLP analysis of large text datasets -- 16. Extracting text from web pages -- 17. Case study 4 solution -- 18. An introduction to graph theory and network analysis -- 19. Dynamic graph theory techniques for node ranking and social network analysis -- 20. Network-driven supervised machine learning -- 21. Training linear classifiers with logistic regression -- 22. Training nonlinear classifiers with decision tree techniques -- 23. Case study 5 solution. Data Science (DE-588)1140936166 gnd rswk-swf Data Mining (DE-588)4428654-5 gnd rswk-swf Python Programmiersprache (DE-588)4434275-5 gnd rswk-swf Data mining Data sets Python (Computer program language) Data Science (DE-588)1140936166 s Data Mining (DE-588)4428654-5 s Python Programmiersprache (DE-588)4434275-5 s DE-604 https://www.gbv.de/dms/bowker/toc/9781617296253.pdf 2022-07-28 Aggregator Inhaltsverzeichnis Digitalisierung UB Passau - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=034005124&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Apeltsin, Leonard ca. 20./21. Jh Data science bookcamp five Python projects Data Science (DE-588)1140936166 gnd Data Mining (DE-588)4428654-5 gnd Python Programmiersprache (DE-588)4434275-5 gnd |
subject_GND | (DE-588)1140936166 (DE-588)4428654-5 (DE-588)4434275-5 |
title | Data science bookcamp five Python projects |
title_auth | Data science bookcamp five Python projects |
title_exact_search | Data science bookcamp five Python projects |
title_exact_search_txtP | Data science bookcamp five Python projects |
title_full | Data science bookcamp five Python projects Leonard Apeltsin |
title_fullStr | Data science bookcamp five Python projects Leonard Apeltsin |
title_full_unstemmed | Data science bookcamp five Python projects Leonard Apeltsin |
title_short | Data science bookcamp |
title_sort | data science bookcamp five python projects |
title_sub | five Python projects |
topic | Data Science (DE-588)1140936166 gnd Data Mining (DE-588)4428654-5 gnd Python Programmiersprache (DE-588)4434275-5 gnd |
topic_facet | Data Science Data Mining Python Programmiersprache |
url | https://www.gbv.de/dms/bowker/toc/9781617296253.pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=034005124&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT apeltsinleonard datasciencebookcampfivepythonprojects |
Es ist kein Print-Exemplar vorhanden.
Inhaltsverzeichnis