Textual data science with R:
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Boca Raton ; London ; New York
CRC Press Taylor & Francis Group
2018
|
Schriftenreihe: | Chapman & Hall/CRC computer science and data analysis series
|
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis |
Beschreibung: | xvii, 194 Seiten Illustrationen, Diagramme |
ISBN: | 9781138626911 1138626910 |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV045269349 | ||
003 | DE-604 | ||
005 | 20230331 | ||
007 | t | ||
008 | 181105s2018 a||| |||| 00||| eng d | ||
020 | |a 9781138626911 |9 978-1-138-62691-1 | ||
020 | |a 1138626910 |9 1-138-62691-0 | ||
035 | |a (OCoLC)1090769100 | ||
035 | |a (DE-599)BVBBV045269349 | ||
040 | |a DE-604 |b ger |e rda | ||
041 | 0 | |a eng | |
049 | |a DE-739 |a DE-473 |a DE-706 | ||
084 | |a ST 250 |0 (DE-625)143626: |2 rvk | ||
084 | |a ES 915 |0 (DE-625)27928: |2 rvk | ||
100 | 1 | |a Bécue-Bertaut, Mónica |e Verfasser |0 (DE-588)1238916554 |4 aut | |
245 | 1 | 0 | |a Textual data science with R |c Monica Becue-Bertaut |
264 | 1 | |a Boca Raton ; London ; New York |b CRC Press Taylor & Francis Group |c 2018 | |
300 | |a xvii, 194 Seiten |b Illustrationen, Diagramme | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
490 | 0 | |a Chapman & Hall/CRC computer science and data analysis series | |
650 | 0 | 7 | |a R |g Programm |0 (DE-588)4705956-4 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Text Mining |0 (DE-588)4728093-1 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Text Mining |0 (DE-588)4728093-1 |D s |
689 | 0 | 1 | |a R |g Programm |0 (DE-588)4705956-4 |D s |
689 | 0 | |5 DE-604 | |
856 | 4 | 2 | |m Digitalisierung UB Passau - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=030657144&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-030657144 |
Datensatz im Suchindex
_version_ | 1804179031629758464 |
---|---|
adam_text | Contents Foreword Preface xiii xv 1 Encoding: from a corpus to statisticaltables 1.1 Textual and contextual data ................................................... 1.1.1 Textual data.................................................................. 1.1.2 Contextual data............................................................ 1.1.3 Documents and aggregate documents ........................ 1.2 Examples and notation ............................................................ 1.3 Choosing textual units ............................................................ 1.3.1 Graphical forms ............................................................ 1.3.2 Lemmas........................................................................... 1.3.3 Stems.............................................................................. 1.3.4 Repeated segments......................................................... 1.3.5 In practice ..................................................................... 1.4 Preprocessing ........................................................................... 1.4.1 Unique spelling............................................................... 1.4.2 Partially automated preprocessing.............................. 1.4.3 Word selection............................................................... 1.5 Word and segment indexes ...................................................... 1.6 The LifeMK corpus: preliminary results .............................. 1.6.1 Verbal content through word and repeated segment indexes
........................................................................... 1.6.2 Univariate description of contextualvariables............. 1.6.3 A note on the frequency range .................................... 1.7 Implementation with Xplortext ............................................. 1.8 Summary.................................................................................... 1 1 1 2 2 3 5 6 6 7 7 7 9 9 9 10 10 10 2 Correspondenceanalysis of textualdata 2.1 Data and goals........................................................................... 2.1.1 Correspondence analysis: a tool for linguistic data analysis........................................................................... 2.1.2 Data: a small example................................................... 2.1.3 Objectives........................................................................ 2.2 Associations between documents and words........................... 17 17 10 13 13 14 15 17 17 18 19 vii
viii Contents Contents 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.2.1 Profile comparisons.................................................. 19 2.2.2 Independence of documents and words.................. 20 2.2.3 The χ2 test.............................................................. 22 2.2.4 Association rates between documents and words ... Active row and column clouds ................................................ 2.3.1 Row and column profile spaces.............................. 24 2.3.2 Distributional equivalence and the χ2 distance.... 2.3.3 Inertia of a cloud...................................................... 25 Fitting document and word clouds.......................................... 2.4.1 Factorial axes............................................................ 26 2.4.2 Visualizing rows and columns................................. 28 2.4.2.1 Category representation................................. 2.4.2.2 Word representation................................. 30 2.4.2.3 Transition formulas....................................... 2.4.2.4 Simultaneous representation of rows and columns............................................................ Interpretation aids..................................................................... 2.5.1 Eigenvalues and representation quality of theclouds . 2.5.2 Contribution of documents and words to axisinertia . 2.5.3 Representation quality of a point........................... 35 Supplementary rows and columns .......................................... 2.6.1 Supplementary tables ...................................................
2.6.2 Supplementary frequency rows and columns......... 36 2.6.3 Supplementary quantitative and qualitative variables . Validating the visualization...................................................... Interpretation scheme for textual CA results ........................ Implementation with Xplortext ............................................. Summary of the CA approach ................................................ 3 Applications of correspondence analysis 3.1 Choosing the level of detail for analyses................................. 3.2 Correspondence analysis on aggregate free text answers ... 3.2.1 Data and objectives................................................ 3.2.2 Word selection......................................................... 3.2.3 CA on the aggregate table....................................... 3.2.3.1 Document representation............................... 3.2.3.2 Word representation........................................ 3.2.3.3 Simultaneous interpretation ofthe plots ... 3.2.4 Supplementary elements.......................................... 3.2.4.1 Supplementary words..................................... 3.2.4.2 Supplementary repeated segments................ 3.2.4.3 Supplementary categories............................... 3.2.5 Implementation with Xplortext............................... 3.3 Direct analysis ........................................................................ 3.3.1 Data and objectives................................................ 3.3.2 3.3.3 3.3.4 23 24 24 26 30 32 32 32 33 34 36 36 ļ 37 37 38 41 41 ) j і 43 43 44 i 44 44
44 45 46 46 1 49 49 49 50 51 52 52 I ;՛ I ļ 1 I be The main features of direct analysis........................... Direct analysis of the culture question........................ Implementation with Xplortext.................................... 4 Clustering in textual data science 4.1 Clustering documents ............................................................... 4.2 Dissimilarity measures between documents ........................... 4.3 Measuring partition quality...................................................... 4.3.1 Document clusters in the factorial space..................... 4.3.2 Partition quality............................................................ 4.4 Dissimilarity measures between document clusters ............... 4.4.1 The single-linkage method............................................. 4.4.2 The complete-linkage method....................................... 4.4.3 Ward’s method............................................................... 4.5 Agglomerative hierarchical clustering .................................... 4.5.1 Hierarchical tree construction algorithm..................... 4.5.2 Selecting the final partition.......................................... 4.5.3 Interpreting clusters...................................................... 4.6 Direct partitioning..................................................................... 4.7 Combining clustering methods ................................................ 4.7.1 Consolidating partitions................................................. 4.7.2 Direct partitioning followed by
AHC........................... 4.8 A procedure for combining CA and clustering ..................... 4.9 Example: joint use of CA and AHC ....................................... 4.9.1 Data and objectives....................................................... 4.9.1.1 Data preprocessing using CA......................... 4.9.1.2 Constructing the hierarchicaltree................. 4.9.1.3 Choosing the final partition............................ 4.10 Contiguity-constrained hierarchical clustering........................ 4.10.1 Principles and algorithm................................................. 4.10.2 AHC of age groups with a chronological constraint . . 4.10.3 Implementation with Xplortext.................................... 4.11 Example: clustering free text answers .................................... 4.11.1 Data and objectives....................................................... 4.11.2 Data preprocessing.......................................................... 4.11.2.1 CA: eigenvalues and total inertia.................. 4.11.2.2 Interpreting the first axes.............................. 4.11.3 AHC: building the tree and choosing the final partition 4.12 Describing cluster features ...................................................... 4.12.1 Lexical features of clusters............................................... 4.12.1.1 Describing clusters in terms of characteristic words................................................................ 4.12.1.2 Describing clusters in terms of characteristic documents....................................................... 4.12.2
Describing clusters using contextual variables.............. 53 53 58 61 61 62 63 63 63 64 64 64 64 65 65 66 66 67 68 68 68 69 69 69 70 70 72 74 74 75 76 76 76 78 78 80 84 88 89 89 91 91
x Contents 4.12.2.1 Describing clusters using contextual qualita tive variables................................................... 4.12.2.2 Describing clusters using quantitative contex tual variables................................................... 4.12.3 Implementation with Xplortext..............................· . . 4.13 Summary of the use of AHC on factorial coordinates coming from CA .................................................................................... 5 Lexical characterization of parts of a corpus 5.1 Characteristic words.................................................................. 5.2 Characteristic words and CA................................................... 5.3 Characteristic words and clustering ....................................... 5.3.1 Clustering based on verbal content.............................. 5.3.2 Clustering based on contextual variables..................... 5.3.3 Hierarchical words......................................................... 5.4 Characteristic documents......................................................... 5.5 Example: characteristic elements and CA .............................. 5.5.1 Characteristic words for the categories........................ 5.5.2 Characteristic words and factorial planes .................. 5.5.3 Documents that characterize categories..................... 5.6 Characteristic words in addition to clustering........................ 5.7 Implementation with Xplortext ............................................. Contents xi 6.5 91 93 94 95 97 98 98 99 99 100 100 101 101 101 104 104 104
107 6 Multiple factor analysis for textual data 109 6.1 Multiple tables in textual data analysis ................................. 109 6.2 Data and objectives .................................................................. 110 6.2.1 Data preprocessing......................................................... 110 6.2.2 Problems posed by lemmatization .............................. 110 6.2.3 Description of the corpora data.................................... Ill 6.2.4 Indexes of the most frequent words.............................. Ill 6.2.5 Notation........................................................................... 112 6.2.6 Objectives........................................................................ 113 6.3 Introduction to MFACT ......................................................... 114 6.3.1 The limits of CA on multiple contingency tables . . . 114 6.3.2 How MFACT works...................................................... 115 6.3.3 Integrating contextual variables.................................... 115 6.4 Analysis of multilingual free text answers.............................. 116 6.4.1 MFACT: eigenvalues of the global analysis.................. 116 6.4.2 Representation of documents and words..................... 117 6.4.3 Superimposed representation of the global and partial configurations.................................................................. 121 6.4.4 Links between the axes of the global analysis and the separate analyses............................................................ 124 6.4.5 Representation of the groups of
words........................ 125 6.4.6 Implementation with Xplortext.................................... 125 Simultaneous analysis of two open-ended questions: impact of lemmatization ........................................................................... 126 6.5.1 Objectives........................................................................ 127 6.5.2 Preliminary steps............................................................ 127 6.5.3 MFACT on the left and right: lemmatized or non-lemmatized............................................................... 128 6.5.4 Implementation with Xplortext.................................... 131 6.6 Other applications of MFACT in textual data science .... 132 6.7 MFACT summary..................................................................... 132 7 Applications and analysis workflows 135 7.1 General rules for presenting results ....................................... 135 7.2 Analyzing bibliographic databases .......................................... 137 7.2.1 Introduction .................................................................. 137 7.2.2 The lupus data............................................................... 137 7.2.2.1 The corpus................................................. 138 7.2.2.2 Exploratory analysis of the corpus................ 138 7.2.3 CA of the documents x words table........................... 139 7.2.3.1 The eigenvalues .............................................. 139 7.2.3.2 Meta-keys and doc-keys............................. 139 7.2.4 Analysis of the year-aggregate
table........................... 143 7.2.4.1 Eigenvalues and CA of thelexical table . . . 144 7.2.5 Chronological study of drug names.............................. 144 7.2.6 Implementation with Xplortext.................................... 147 7.2.7 Conclusions from the study.......................................... 148 7.3 Badinter’s speech: a discursive strategy ................................. 149 7.3.1 Introduction .................................................................. 149 7.3.2 Methods........................................................................... 149 7.3.2.1 Breaking up the corpus intodocuments . . . 149 7.3.2.2 The speech trajectory unveiled by CA.... 149 7.3.3 Results ........................................................................... 150 7.3.4 Argument flow....................... 152 7.3.5 Conclusions on the study of Badinter’s speech............ 156 7.3.6 Implementation with Xplortext.................................... 156 7.4 Political speeches ..................................................................... 157 7.4.1 Introduction .................................................................. 157 7.4.2 Data and objectives...................................................... 157 7.4.3 Methodology.................................................................. 159 7.4.4 Results ........................................................................... 160 7.4.4.1 Data preprocessing ....................................... 160 7.4.4.2 Lexicometric characteristics of the 11 speeches and lexical table
coding........................... 160 7.4.4.3 Eigenvalues and Cramer’s V........................ 160 7.4.4.4 Speech trajectory .......................................... 164 7.4.4.5 Word representation.................................. 167
Contents xii 7.5 7.4.4.6 Remarks ............................................................ 7.4.4.7 Hierarchical structure of the corpus............ 7.4.4.8 Conclusions......................................................... 7.4.5 Implementation with Xplortext...................................... Corpus of sensory descriptions .................................................. 7.5.1 Introduction .................................................................... 7.5.2 Data.................................................................................... 7.5.2.1 Eight Catalan wines........................................ 7.5.2.2 Jury.................................................................... 7.5.2.3 Verbal categorization...................................... 7.5.2.4 Encoding the data............................................. 7.5.3 Objectives.......................................................................... 7.5.4 Statistical methodology.................................................. 7.5.4.1 MFACT and constructing the mean configu ration ................................................................. 7.5.4.2 Determining consensual words...................... 7.5.5 Results .............................................................................. 7.5.5.1 Data preprocessing .......................................... 7.5.5.2 Some initial results .......................................... 7.5.5.3 Individual configurations................................ 7.5.5.4 MFACT: directions of inertia common to the majority of
groups........................................... 7.5.5.5 MFACT: representing words and documents on the first plane............................................... 7.5.5.6 Word contributions........................................ 7.5.5.7 MFACT: group representation...................... 7.5.5.8 Consensual words............................................. 7.5.6 Conclusion ....................................................................... 7.5.7 Implementation with Xplortext..................................... 169 170 171 173 173 173 174 174 175 175 175 176 176 176 177 177 177 178 178 178 180 182 184 184 184 186 Appendix: Textual data science packages in R 187 Bibliography 189 Index 191
|
any_adam_object | 1 |
author | Bécue-Bertaut, Mónica |
author_GND | (DE-588)1238916554 |
author_facet | Bécue-Bertaut, Mónica |
author_role | aut |
author_sort | Bécue-Bertaut, Mónica |
author_variant | m b b mbb |
building | Verbundindex |
bvnumber | BV045269349 |
classification_rvk | ST 250 ES 915 |
ctrlnum | (OCoLC)1090769100 (DE-599)BVBBV045269349 |
discipline | Informatik Sprachwissenschaft Literaturwissenschaft |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01547nam a2200373 c 4500</leader><controlfield tag="001">BV045269349</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20230331 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">181105s2018 a||| |||| 00||| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781138626911</subfield><subfield code="9">978-1-138-62691-1</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">1138626910</subfield><subfield code="9">1-138-62691-0</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1090769100</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV045269349</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rda</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-739</subfield><subfield code="a">DE-473</subfield><subfield code="a">DE-706</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 250</subfield><subfield code="0">(DE-625)143626:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ES 915</subfield><subfield code="0">(DE-625)27928:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Bécue-Bertaut, Mónica</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)1238916554</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Textual data science with R</subfield><subfield code="c">Monica Becue-Bertaut</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Boca Raton ; London ; New York</subfield><subfield code="b">CRC Press Taylor & Francis Group</subfield><subfield code="c">2018</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">xvii, 194 Seiten</subfield><subfield code="b">Illustrationen, Diagramme</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="0" ind2=" "><subfield code="a">Chapman & Hall/CRC computer science and data analysis series</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">R</subfield><subfield code="g">Programm</subfield><subfield code="0">(DE-588)4705956-4</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Text Mining</subfield><subfield code="0">(DE-588)4728093-1</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Text Mining</subfield><subfield code="0">(DE-588)4728093-1</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">R</subfield><subfield code="g">Programm</subfield><subfield code="0">(DE-588)4705956-4</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Passau - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=030657144&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-030657144</subfield></datafield></record></collection> |
id | DE-604.BV045269349 |
illustrated | Illustrated |
indexdate | 2024-07-10T08:13:24Z |
institution | BVB |
isbn | 9781138626911 1138626910 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-030657144 |
oclc_num | 1090769100 |
open_access_boolean | |
owner | DE-739 DE-473 DE-BY-UBG DE-706 |
owner_facet | DE-739 DE-473 DE-BY-UBG DE-706 |
physical | xvii, 194 Seiten Illustrationen, Diagramme |
publishDate | 2018 |
publishDateSearch | 2018 |
publishDateSort | 2018 |
publisher | CRC Press Taylor & Francis Group |
record_format | marc |
series2 | Chapman & Hall/CRC computer science and data analysis series |
spelling | Bécue-Bertaut, Mónica Verfasser (DE-588)1238916554 aut Textual data science with R Monica Becue-Bertaut Boca Raton ; London ; New York CRC Press Taylor & Francis Group 2018 xvii, 194 Seiten Illustrationen, Diagramme txt rdacontent n rdamedia nc rdacarrier Chapman & Hall/CRC computer science and data analysis series R Programm (DE-588)4705956-4 gnd rswk-swf Text Mining (DE-588)4728093-1 gnd rswk-swf Text Mining (DE-588)4728093-1 s R Programm (DE-588)4705956-4 s DE-604 Digitalisierung UB Passau - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=030657144&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Bécue-Bertaut, Mónica Textual data science with R R Programm (DE-588)4705956-4 gnd Text Mining (DE-588)4728093-1 gnd |
subject_GND | (DE-588)4705956-4 (DE-588)4728093-1 |
title | Textual data science with R |
title_auth | Textual data science with R |
title_exact_search | Textual data science with R |
title_full | Textual data science with R Monica Becue-Bertaut |
title_fullStr | Textual data science with R Monica Becue-Bertaut |
title_full_unstemmed | Textual data science with R Monica Becue-Bertaut |
title_short | Textual data science with R |
title_sort | textual data science with r |
topic | R Programm (DE-588)4705956-4 gnd Text Mining (DE-588)4728093-1 gnd |
topic_facet | R Programm Text Mining |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=030657144&sequence=000001&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT becuebertautmonica textualdatasciencewithr |