Taming text: how to find, organize, and manipulate it
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Shelter Island, NY
Manning
2013
|
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis |
Beschreibung: | XXI, 298 S. Ill., graph. Darst. 24 cm |
ISBN: | 9781933988382 193398838X |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV037187785 | ||
003 | DE-604 | ||
005 | 20211027 | ||
007 | t | ||
008 | 110127s2013 ad|| |||| 00||| eng d | ||
020 | |a 9781933988382 |c pbk. |9 978-1-933988-38-2 | ||
020 | |a 193398838X |9 1-933988-38-X | ||
035 | |a (OCoLC)706990107 | ||
035 | |a (DE-599)BVBBV037187785 | ||
040 | |a DE-604 |b ger |e rakwb | ||
041 | 0 | |a eng | |
049 | |a DE-19 |a DE-523 |a DE-188 |a DE-739 |a DE-B768 |a DE-210 | ||
084 | |a ST 306 |0 (DE-625)143654: |2 rvk | ||
084 | |a ST 350 |0 (DE-625)143667: |2 rvk | ||
100 | 1 | |a Ingersoll, Grant S. |e Verfasser |4 aut | |
245 | 1 | 0 | |a Taming text |b how to find, organize, and manipulate it |c Grant S. Ingersoll ; Thomas S. Morton ; Andrew L. Farris |
264 | 1 | |a Shelter Island, NY |b Manning |c 2013 | |
300 | |a XXI, 298 S. |b Ill., graph. Darst. |c 24 cm | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
650 | 0 | 7 | |a Text Mining |0 (DE-588)4728093-1 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Text Mining |0 (DE-588)4728093-1 |D s |
689 | 0 | |5 DE-604 | |
700 | 1 | |a Morton, Thomas S. |e Verfasser |4 aut | |
700 | 1 | |a Farris, Andrew L. |e Verfasser |4 aut | |
856 | 4 | 2 | |m HBZ Datenaustausch |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=021102273&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-021102273 |
Datensatz im Suchindex
_version_ | 1804143772419751936 |
---|---|
adam_text | Titel: Taming text
Autor: Ingersoll, Grant S
Jahr: 2013
contents
1
o
Jmmd
foreword xiii
preface xiv
acknowledgments xvii
about this book xix
about the cover illustration xxii
Getting started taming text 1
1.1 Why taming text is important 2
1.2 Preview: A fact-based, question answering system 4
Hello, Dr. Frankenstein 5
1.3 Understanding text is hard 8
1.4 Text, tamed 10
1.5 Text and the intelligent app: search and beyond 11
Searching and matching 12 • Extracting information 13
Grouping information 13* An intelligent application 14
1.6 Summary 14
1.7 Resources 14
Foundations of taming text 16
2.1 Foundations of language 17
Words and their categories 18 * Phrases and clauses 19
Morphology 20
CONTENTS
2.2 Common tools for text processing 21
String manipulation tools 21 * Tokens and tokenization 22
Part of speech assignment 24 * Stemming 25 * Sentence
detection 27 ¦ Parsing and grammar 28 * Sequence
modeling 30
2.3 Preprocessing and extracting content from common file
formats 31
The importance of preprocessing 31 * Extracting content using
Apache Tika 33
2.4 Summary 36
2.5 Resources 36
Searching 37
3.1 Search and faceting example: Amazon.com 38
3.2 Introduction to search concepts 40
Indexing content 41 * User input 43 * Ranking documents
with the vector space model 46 * Results display 49
3.3 Introducing the Apache Solr search server 52
Running Solr for the first time 52 * Understanding Solr
concepts 54
3.4 Indexing content with Apache Solr 57
Indexing using XML 58 * Extracting and indexing content
using Solr and Apache Tika 59
3.5 Searching content with Apache Solr 63
Solr query input parameters 64 * Faceting on extracted
content 67
3.6 Understanding search performance factors 69
Judging quality 69 * Judging quantity 73
3.7 Improving search performance 74
Hardware improvements 74 * Analysis improvements 75
Query performance improvements 76 * Alternative scoring
models 79* Techniques for improving Solr performance 80
3.8 Search alternatives 82
3.9 Summary 83
3.10 Resources 83
CONTENTS
Fuzzy string matching 84
4.1 Approaches to fuzzy string matching 86
Character overlap measures 86 * Edit distance measures 89
N-gram edit distance 92
4.2 Finding fuzzy string matches 94
Using prefixes for matching with Solr 94* Using a trie for
prefix matching 95 * Using n-gramsfor matching 99
4.3 Building fuzzy string matching applications 100
Adding type-ahead to search 101 * Query spell-checking for
search 105 * Record matching 109
4.4 Summary 114
4.5 Resources 114
Identifying people, places, and things 115
5.1 Approaches to named-entity recognition 117
Using rules to identify names 117 * Using statistical
classifiers to identify names 118
5.2 Basic entity identification with OpenNLP 119
Finding names with OpenNLP 120* Interpreting names
identified by OpenNLP 121 * Filtering names based on
probability 122
5.3 In-depth entity identification with OpenNLP 123
Identifying multiple entity types with OpenNLP 123
Under the hood: how OpenNLP identifies names 126
5.4 Performance of OpenNLP 128
Quality of results 129 * Runtime performance 130
Memory usage in OpenNLP 131
5.5 Customizing OpenNLP entity identification
for a new domain 132
The whys and hows of training a model 132 * Training
an OpenNLP model 133 * Altering modeling inputs 134
A new way to model names 136
5.6 Summary 138
5.7 Further reading 139
CONTENTS
Clustering text 140
6.1 Google News document clustering 141
6.2 Clustering foundations 142
Three types of text to cluster 142* Choosing a clustering
algorithm 144 * Determining similarity 145 * Labeling the
results 146* How to evaluate clustering results 147
6.3 Setting up a simple clustering application 149
6.4 Clustering search results using Carrot2 149
Using the Carrot2 API 150 * Clustering Solr search results
using Carrot2 151
6.5 Clustering document collections with Apache
Mahout 154
Preparing the data for clustering 155 * K-Means
clustering 158
6.6 Topic modeling using Apache Mahout 162
6.7 Examining clustering performance 164
Feature selection and reduction 164* Carrot2 performance
and quality 167* Mahout clustering benchmarks 168
6.8 Acknowledgments 172
6.9 Summary 173
6.10 References 173
Classification, categorization, and tagging 175
7.1 Introduction to classification and categorization 177
7.2 The classification process 180
Choosing a classification scheme 181 * Identifying features
for text categorization 182* The importance of training
data 183 * Evaluating classifier performance 186
Deploying a classifier into production 188
7.3 Building document categorizers using Apache
Lucene 189
Categorizing text with Lucene 189 * Preparing the training
data for the MoreLikeThis categorizer 191 * Training the
MoreLikeThis categorizer 193 * Categorizing documents
with the MoreLikeThis categorizer 197* Testing the
MoreLikeThis categorizer 199 * MoreLikeThis in
production 201
o
9
CONTENTS
7.4 Training a naive Bayes classifier using Apache
Mahout 202
Categorizing text using naive Bayes classification 202
Preparing the training data 204 * Withholding test data 207
Training the classifier 208 * Testing the classifier 209
Improving the bootstrapping process 210* Integrating the
Mahout Bayes classifier with Solr 212
7.5 Categorizing documents with OpenNLP 215
Regression models and maximum entropy * document
categorization 216* Preparing training data for the maximum
entropy document categorizer 219 * Training the maximum
entropy document categorizer 220 * Testing the maximum entropy
document classifier 224 * Maximum entropy document
categorization in production 225
7.6 Building a tag recommender using Apache Solr 227
Collecting training data for tag recommendations 229
Preparing the training data 231 * Training the Solr tag
recommender 232 * Creating tag recommendations 234
Evaluating the tag recommender 236
7.7 Summary 238
7.8 References 239
Building an example question answering system 240
8.1 Basics of a question answering system 242
8.2 Installing and running the QA code 243
8.3 A sample question answering architecture 245
8.4 Understanding questions and producing answers 248
Training the answer type classifier 248 * Chunking the
query 251 * Computing the answer type 252 * Generating the
query 255 * Ranking candidate passages 256
8.5 Steps to improve the system 258
8.6 Summary 259
8.7 Resources 259
Untamed text: exploring the next frontier 260
9.1 Semantics, discourse, and pragmatics:
exploring higher levels of NLP 261
Semantics 262 * Discourse 263 * Pragmatics 264
CONTENTS
9.2 Document and collection summarization 266
9.3 Relationship extraction 268
Overview of approaches 270 * Evaluation 272 * Tools for
relationship extraction 273
9.4 Identifying important content and people 273
Global importance and authoritativeness 274 * Personal
importance 275 * Resources and pointers on importance 275
9.5 Detecting emotions via sentiment analysis 276
History and review 276* Tools and data needs 278* A basic
polarity algorithm 279 * Advanced topics 280 * Open source
libraries for sentiment analysis 281
9.6 Cross-language information retrieval 282
9.7 Summary 284
9.8 References 284
index 287
|
any_adam_object | 1 |
author | Ingersoll, Grant S. Morton, Thomas S. Farris, Andrew L. |
author_facet | Ingersoll, Grant S. Morton, Thomas S. Farris, Andrew L. |
author_role | aut aut aut |
author_sort | Ingersoll, Grant S. |
author_variant | g s i gs gsi t s m ts tsm a l f al alf |
building | Verbundindex |
bvnumber | BV037187785 |
classification_rvk | ST 306 ST 350 |
ctrlnum | (OCoLC)706990107 (DE-599)BVBBV037187785 |
discipline | Informatik |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01452nam a2200361 c 4500</leader><controlfield tag="001">BV037187785</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20211027 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">110127s2013 ad|| |||| 00||| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781933988382</subfield><subfield code="c">pbk.</subfield><subfield code="9">978-1-933988-38-2</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">193398838X</subfield><subfield code="9">1-933988-38-X</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)706990107</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV037187785</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-19</subfield><subfield code="a">DE-523</subfield><subfield code="a">DE-188</subfield><subfield code="a">DE-739</subfield><subfield code="a">DE-B768</subfield><subfield code="a">DE-210</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 306</subfield><subfield code="0">(DE-625)143654:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 350</subfield><subfield code="0">(DE-625)143667:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Ingersoll, Grant S.</subfield><subfield code="e">Verfasser</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Taming text</subfield><subfield code="b">how to find, organize, and manipulate it</subfield><subfield code="c">Grant S. Ingersoll ; Thomas S. Morton ; Andrew L. Farris</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Shelter Island, NY</subfield><subfield code="b">Manning</subfield><subfield code="c">2013</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XXI, 298 S.</subfield><subfield code="b">Ill., graph. Darst.</subfield><subfield code="c">24 cm</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Text Mining</subfield><subfield code="0">(DE-588)4728093-1</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Text Mining</subfield><subfield code="0">(DE-588)4728093-1</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Morton, Thomas S.</subfield><subfield code="e">Verfasser</subfield><subfield code="4">aut</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Farris, Andrew L.</subfield><subfield code="e">Verfasser</subfield><subfield code="4">aut</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">HBZ Datenaustausch</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=021102273&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-021102273</subfield></datafield></record></collection> |
id | DE-604.BV037187785 |
illustrated | Illustrated |
indexdate | 2024-07-09T22:52:58Z |
institution | BVB |
isbn | 9781933988382 193398838X |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-021102273 |
oclc_num | 706990107 |
open_access_boolean | |
owner | DE-19 DE-BY-UBM DE-523 DE-188 DE-739 DE-B768 DE-210 |
owner_facet | DE-19 DE-BY-UBM DE-523 DE-188 DE-739 DE-B768 DE-210 |
physical | XXI, 298 S. Ill., graph. Darst. 24 cm |
publishDate | 2013 |
publishDateSearch | 2013 |
publishDateSort | 2013 |
publisher | Manning |
record_format | marc |
spelling | Ingersoll, Grant S. Verfasser aut Taming text how to find, organize, and manipulate it Grant S. Ingersoll ; Thomas S. Morton ; Andrew L. Farris Shelter Island, NY Manning 2013 XXI, 298 S. Ill., graph. Darst. 24 cm txt rdacontent n rdamedia nc rdacarrier Text Mining (DE-588)4728093-1 gnd rswk-swf Text Mining (DE-588)4728093-1 s DE-604 Morton, Thomas S. Verfasser aut Farris, Andrew L. Verfasser aut HBZ Datenaustausch application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=021102273&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Ingersoll, Grant S. Morton, Thomas S. Farris, Andrew L. Taming text how to find, organize, and manipulate it Text Mining (DE-588)4728093-1 gnd |
subject_GND | (DE-588)4728093-1 |
title | Taming text how to find, organize, and manipulate it |
title_auth | Taming text how to find, organize, and manipulate it |
title_exact_search | Taming text how to find, organize, and manipulate it |
title_full | Taming text how to find, organize, and manipulate it Grant S. Ingersoll ; Thomas S. Morton ; Andrew L. Farris |
title_fullStr | Taming text how to find, organize, and manipulate it Grant S. Ingersoll ; Thomas S. Morton ; Andrew L. Farris |
title_full_unstemmed | Taming text how to find, organize, and manipulate it Grant S. Ingersoll ; Thomas S. Morton ; Andrew L. Farris |
title_short | Taming text |
title_sort | taming text how to find organize and manipulate it |
title_sub | how to find, organize, and manipulate it |
topic | Text Mining (DE-588)4728093-1 gnd |
topic_facet | Text Mining |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=021102273&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT ingersollgrants tamingtexthowtofindorganizeandmanipulateit AT mortonthomass tamingtexthowtofindorganizeandmanipulateit AT farrisandrewl tamingtexthowtofindorganizeandmanipulateit |