Text mining: classification, clustering, and applications
Gespeichert in:
Format: | Buch |
---|---|
Sprache: | English |
Veröffentlicht: |
Boca Raton [u.a.]
CRC Press
2009
|
Schriftenreihe: | Data mining and knowledge discovery series
|
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis |
Beschreibung: | Includes bibliographical references and index |
Beschreibung: | XXX, 290 S. Ill., graf. Darst. |
ISBN: | 9781420059403 |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV035605094 | ||
003 | DE-604 | ||
005 | 20160204 | ||
007 | t | ||
008 | 090707s2009 a||| |||| 00||| eng d | ||
010 | |a 2009013047 | ||
020 | |a 9781420059403 |c hardcover |9 978-1-4200-5940-3 | ||
035 | |a (OCoLC)144226505 | ||
035 | |a (DE-599)GBV595954707 | ||
040 | |a DE-604 |b ger |e aacr | ||
041 | 0 | |a eng | |
049 | |a DE-473 |a DE-703 |a DE-20 |a DE-355 |a DE-M382 | ||
050 | 0 | |a QA76.9.D343 | |
082 | 0 | |a 006.3/12 |2 22 | |
084 | |a ST 302 |0 (DE-625)143652: |2 rvk | ||
084 | |a ST 306 |0 (DE-625)143654: |2 rvk | ||
084 | |a ST 530 |0 (DE-625)143679: |2 rvk | ||
245 | 1 | 0 | |a Text mining |b classification, clustering, and applications |c ed. by Ashok Srivastava ... |
264 | 1 | |a Boca Raton [u.a.] |b CRC Press |c 2009 | |
300 | |a XXX, 290 S. |b Ill., graf. Darst. | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
490 | 0 | |a Data mining and knowledge discovery series | |
500 | |a Includes bibliographical references and index | ||
650 | 0 | |a Data mining / Statistical methods | |
650 | 4 | |a Data mining |x Statistical methods | |
650 | 0 | 7 | |a Text Mining |0 (DE-588)4728093-1 |2 gnd |9 rswk-swf |
655 | 7 | |0 (DE-588)4143413-4 |a Aufsatzsammlung |2 gnd-content | |
689 | 0 | 0 | |a Text Mining |0 (DE-588)4728093-1 |D s |
689 | 0 | |5 DE-604 | |
700 | 1 | |a Srivastava, Ashok K. |e Sonstige |0 (DE-588)120408945 |4 oth | |
856 | 4 | 2 | |m Digitalisierung UB Bamberg |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=017660340&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-017660340 |
Datensatz im Suchindex
_version_ | 1804139273897639936 |
---|---|
adam_text | Contents
List of Figures
xiii
List of Tables
xix
Introduction
xxi
About the Editors
xxvii
Contributor List
xxix
1
Analysis of Text Patterns Using Kernel Methods
1
Marco
Turchi, Alessio, Mammone,
and
Nello Cristianini
1.1
Introduction
............................ 1
1.2
General Overview on Kernel Methods
............. 1
1.2.1
Finding Patterns in Feature Space
........... 5
1.2.2
Formal Properties of Kernel Functions
......... 8
1.2.3
Operations on Kernel Functions
............ 10
1.3
Kernels for Text
......................... 11
1.3.1
Vector Space Model
................... 11
1.3.2
Semantic Kernels
..................... 13
1.3.3
String Kernels
...................... 17
1.4
Example
.............................. 19
1.5
Conclusion and Further Reading
................ 22
2
Detection of Bias in Media Outlets with Statistical Learning
Methods
27
Blaz
Fortuna,
Carolina Galleguillos, and
Nello Cristianini
2.1
Introduction
............................ 27
2.2
Overview of the Experiments
.................. 29
2.3
Data Collection and Preparation
................ 30
2.3.1
Article Extraction from HTML Pages
......... 31
2.3.2
Data Preparation
..................... 31
2.3.3
Detection of Matching News Items
........... 32
2.4
News Outlet Identification
....................
Зо
2.5
Topic-Wise Comparison of Term Bias
............. 38
2.6
News Outlets Map
........................ 40
2.6.1
Distance Based on Lexical Choices
........... 42
vin
2.6.2
Distance
Based
on Choice of Topics
.......... 43
2.7
Related Work
........................... 44
2.8
Conclusion
............................ 45
2.9
Appendix A: Support Vector Machines
............. 48
2.10
Appendix B: Bag of Words and Vector Space Models
..... 48
2.11
Appendix C: Kernel Canonical Correlation Analysis
..... 49
2.12
Appendix D: Multidimensional Scaling
............. 50
Collective Classification for Text Classification
51
Galileo
Namata,
Prithviraj Sen, Mustafa Bilgic, and
Lise
Getoor
3.1
Introduction
............................ 51
3.2
Collective Classification: Notation and Problem Definition
. . 53
3.3
Approximate Inference Algorithms for Approaches Based on
Local Conditional Classifiers
................... 53
3.3.1
Iterative Classification
.................. 54
3.3.2
Gibbs Sampling
...................... 55
3.3.3
Local Classifiers and Further Optimizations
...... 55
3.4
Approximate Inference Algorithms for Approaches Based on
Global Formulations
....................... 56
3.4.1
Loopy Belief Propagation
................ 58
3.4.2
Relaxation Labeling via Mean-Field Approach
.... 59
3.5
Learning the Classifiers
..................... 60
3.6
Experimental Comparison
.................... 60
3.6.1
Features Used
....................... 60
3.6.2
Real-World
Datasets
................... 60
3.6.3
Practical Issues
...................... 63
3.7
Related Work
........................... 64
3.8
Conclusion
............................ 66
3.9
Acknowledgments
......................... 66
Topic Models
71
David M. BleA and John D. Lafferty
4.1
Introduction
............................ 71
4.2
Latent Dirichlet Allocation
................... 72
4.2.1
Statistical Assumptions
................. 73
4.2.2
Exploring a Corpus with the Posterior Distribution
. . 75
4.3
Posterior Inference for
LDÂ
................... 76
4.3.1
Mean Field Variational Inference
.....,...... 78
4.3.2
Practical Considerations
................. 81
4.4
Dynamic Topic Models and Correlated Topic Models
..... 82
4.4.1
The Correlated Topic Model
.............. 82
4.4.2
The Dynamic Topic Model
................ 84
4.5
Discussion
........,,,,...........,..... 39
їх
Nonnegative Matrix and Tensor
Factorization for Discussion
Tracking
95
Brett W.
Bader,
Michael W. Berry, and Amy
N.
Langville
5.1
Introduction
............................ 95
5.1.1
Extracting Discussions
.................. 96
5.1.2
Related Work
....................... 96
5.2
Notation
.............................. 97
5.3
Tensor Decompositions and Algorithms
............ 98
5.3.1
PARAFAC-ALS
..................... 100
5.3.2
Nonnegative
Tensor Factorization
............ 100
5.4
Enron Subset
........................... 102
5.4.1
Term Weighting Techniques
............... 103
5.5
Observations and Results
.................... 105
5.5.1 Nonnegative
Tensor Decomposition
........... 105
5.5.2
Analysis of Three-Way Tensor
............. 106
5.5.3
Analysis of Four-Way Tensor
.............. 108
5.6
Visualizing Results of the NMF Clustering
...........
Ill
5.7
Future Work
........................... 116
Text Clustering with Mixture of
von Mises-Fisher
Distribu¬
tions
121
Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, and Suvrit
Sra
6.1
Introduction
............................ 121
6.2
Related Work
........................... 123
6.3
Preliminaries
........................... 124
6.3.1
The
von Mises-Fisher (vMF)
Distribution
....... 124
6.3.2
Maximum Likelihood Estimates
............. 125
6.4
EM on a Mixttire of vMFs (moVMF)
.............. 126
6.5
Handling High-Dimensional Text
Datasets
........... 127
6.5.1
Approximating
к
..................... 128
6.5.2
Experimental Study of the Approximation
....... 130
6.6
Algorithms
............................ 132
6.7
Experimental Results
....................... 134
6.7.1
Datasets
.......................... 135
6.7.2
Methodology
....................... 138
6.7.3
Simulated
Datasets
.................... 138
6.7.4
Classics Family of
Datasets
............... 140
6.7.5
Yahoo News
Dataset
..........,........ 143
6.7.6 20
Newsgroup Family of
Datasets
............ 143
6.7.7
Slashdofc
Datasets
...................... 145
6.8
Discussion
............................. 146
6.9
Conclusions and Future Work
.................. 148
Constrained Partitional Clustering of Text Data: An
Overview
155
Sugato
Basu
and Ian Davidson
7.1
Introduction
............................ 155
7.2
Uses of Constraints
........................ 157
7.2.1
Constraint-Based Methods
................ 157
7.2.2
Distance-Based Methods
................. 158
7.3
Text Clustering
.......................... 159
7.3.1
Pre-Processing
...................... 161
7.3.2
Distance Measures
.................... 162
7.4
Partitional Clustering with Constraints
............ 163
7.4.1
COP-KMeans
....................... 163
7.4.2
Algorithms with Penalties
-
PKM, CVQE
....... 164
7.4.3
LCVQE: An Extension to CVQE
............ 167
7.4.4
Probabilistic Penalty
-
PKM
.............. 167
7.5
Learning Distance Function with Constraints
......... 168
7.5.1
Generalized Mahalanobis Distance Learning
...... 168
7.5.2
Kernel Distance Functions Using AdaBoost
...... 169
7.6
Satisfying Constraints and Learning Distance Functions
. . . 170
7.6.1
Hidden Markov Random Field (HMRF) Model
.... 170
7.6.2
EM Algorithm
...................... 173
7.6.3
Improvements to HMRF-KMeans
........... 173
7.7
Experiments
............................ 174
7.7.1
Datasets
.......................... 174
7.7.2
Clustering Evaluation
............... , . . 175
7.7.3
Methodology
....................... 176
7.7.4
Comparison of Distance Functions
........... 176
7.7.5
Experimental Results
.................. 177
7.8
Conclusions
............................ 180
Adaptive Information Filtering
185
Yi Zhang
8.1
Introduction
............................ 185
8.2
Standard Evaluation Measures
................. 188
8.3
Standard Retrieval Models and Filtering Approaches
..... 190
8.3.1
Existing Retrieval Models
................ 190
8.3.2
Existing Adaptive Filtering Approaches
........ 192
8.4
Collaborative Adaptive Filtering
................ 194
8.5
Novelty and Redundancy Detection
............... 196
8.5.1
Set Difference
....................... 199
8.5.2
Geometric Distance
................... 199
8.5.3
Distributional Similarity
................. 200
8.5.4
Summary of Novelty Detection
............. 201
8.6
Other Adaptive Filtering Topics
................ 201
8.6.1
Beyond Bag of Words
.....,............ 202
8.6.2
Using Implicit Feedback
................. 202
8.6.3
Exploration and Exploitation Trade Off
........ 203
8.6.4
Evaluation beyond Topical Relevance
......... 203
8.7
Acknowledgments
......................... 204
9
Utility-Based Information Distillation
213
Yiming Yang and Abhimanyu Lad
9.1
Introduction
............................ 213
9.1.1
Related Work in Adaptive Filtering (AF)
....... 213
9.1.2
Related Work in Topic Detection and Tracking (TDT)
214
9.1.3
Limitations of Current Solutions
............ 215
9.2
A Sample Task
.......................... 216
9.3
Technical Cores
.......................... 218
9.3.1
Adaptive Filtering Component
............. 218
9.3.2
Passage Retrieval Component
.............. 219
9.3.3
Novelty Detection Component
............. 220
9.3.4
Anti-Redundant Ranking Component
......... 220
9.4
Evaluation Methodology
..................... 221
9.4.1
Answer Keys
....................... 221
9.4.2
Evaluating the Utility of a Sequence of Ranked Lists
. 223
9.5
Data
................................ 225
9.6
Experiments and Results
..................... 226
9.6.1
Baselines
......................... 226
9.6.2
Experimental Setup
................... 226
9.6.3
Results
.......................... 227
9.7
Concluding Remarks
....................... 229
9.8
Acknowledgments
......................... 229
10
Text Search-Enhanced with Types and Entities
233
Soumen Chakrabarti, Sujatha Das, Vijay Krishnan, and
Kriti
Puniyani
10.1
Entity-Aware Search Architecture
................ 233
10.1.1
Guessing Answer Types
................. 234
10.1.2
Scoring Snippets
..................... 235
10.1.3
Efficient Indexing and Query Processing
........ 236
10.1.4
Comparison with Prior Work
.............. 236
10.2
Understanding the Question
................... 236
10.2.1
Answer Type Clues in Questions
............ 239
10.2.2
Sequential Labeling of Type Clue Spans
........ 240
10.2.3
From Type Glue Spans to Answer Types
........ 245
10.2.4
Experiments
....................... 247
10.3
Scoring Potential Answer Snippets
............... 251
10.3.1
A Proximity Model
.................... 253
10.3.2
Learning the Proximity Scoring Function
....... 255
10.3.3
Experiments
...............·....... 257
10.4
Indexing and Query Processing
................. 260
XU
10.4.1
Probability
of a Query Atype
.............. 262
10.4.2
Pre-Generalize and Post-Filter
............. 262
10.4.3
Atype Subset Index Space Model
............ 265
10.4.4
Query Time Bloat Model
................ 266
10.4.5
Choosing an Atype Subset
................ 269
10.4.6
Experiments
....................... 271
10.5
Conclusion
............................ 272
10.5.1
Summary
......................... 272
10.5.2
Ongoing and Future Work
................ 273
Index
279
List of Figures
1.1
Modularity of kernel-based algorithms: the data are trans¬
formed into a kernel matrix, by using a kernel function; then
the pattern analysis algorithm uses this information to find
interesting relations, which are all written in the form of a
linear combination of kernel functions
............. 3
1.2
The evolutionary rooted tree built using a 4-spectrum kernel
and the Neighbor Joining algorithm
............... 20
1.3
Multi-dimensional scaling using a 4-spectrum kernel distance
matrix
............................... 21
2.1
Number of discovered pahs vs. time window size
....... 34
2.2
Distribution of
ВЕР
for
300
random sets
........... 38
2.3
Relative distance between news outlets using the
ВЕР
metric
43
2.4
Relative distance between news outlets, using the Topic simi¬
larity
............................... 44
3.1
À
small text classification problem. Each box denotes a doc¬
ument, each directed edge between a pair of boxes denotes
a hyperlink, and each oval node denotes a random variable.
Assume the smaller oval nodes within each box represent the
presence of the words. wi,w2, and
гоц,
in the document and
the larger oval nodes the label of the document where the set
of label values is
С
=
{LI, L2}.
A shaded oval denotes an
observed variable whereas an unshaded oval node denotes an
unobserved variable whose value needs to be predicted.
... 52
4.1
Five topics from
a
öO-topic
LĐA
model fit to Science from
1980-2002............................. 72
4.2
A graphical model representation of the latent Dirichlet allo¬
cation (LDA). Nodes denote random variables; edges denote
dependence between random variables. Shaded nodes denote
observed random variables; unshaded nodes denote hidden
random variables. The rectangular boxes are plate notation,
which denote replication
.......·............. 74
4.3
Pive
topics from a 50-topic model fit to the Yale Law Journal
from
1980-2003,........-................ 75
ХШ
XIV
4.4
(See
color
insert.) The analysis of a document from Sci¬
ence. Document similarity was computed using Eq.
(4.4);
topic words were computed using Eq.
(4.3)........... 77
4.5
One iteration of mean field variational inference for LDA. This
algorithm is repeated until the objective function in Eq.
(4.6)
converges
............................. 80
4.6
The graphical model for the correlated topic model in Sec¬
tion
4.4.1............................. 84
4.7
A portion of the topic graph learned from the
16,351
OCR arti¬
cles from Science
(1990-1999).
Each topic node is labeled with
its five most probable phrases and has font proportional to its
popularity in the corpus. (Phrases are found by permutation
test.) The full model can be browsed with pointers to the origi¬
nal articles at http://www.es.cmu.edu/ lemur/science/ and on
STATLIB. (The algorithm for constructing this graph from the
covariance matrix of the logistic normal is given in
(9).) ... 85
4.8
A graphical model representation of a dynamic topic model
(for three time slices). Each topic s parameters
ßt,k
evolve
over time
............................. 86
4.9
Two topics from a dynamic topic model fit to the Science
archive
(1880-2002)........................ 88
4.10
The top ten most similar articles to the query in Science
(1880-2002),
scored by Eq.
(4.4)
using the posterior distri¬
bution from the dynamic topic model
.............. 89
5.1
PARAPAC provides a three-way decomposition with some
similarity to the singular value decomposition
......... 99
5.2
(See color insert.) Five discussion topics identified in the three-
way analysis over months
.................... 106
5.3
Three discussion topics identified in the three-way analysis
over days
............................. 108
5.4
Weekly betting pool identified in the three-way (top) and four-
way (bottom) analyses
...................... 109
5.5
Long running discussion on FERC s various rulings of RTOs.
110
5.6
Forwarding of Texas A&M school fight song
..........
Ill
5.7
(See color insert.) Pixel plot of the raw Enron term-by-email
matrix
............................... 112
5.8
(See color insert.) Pixel plot of the reordered Enron term-by-
email matrix
............................ 113
5.9
(See color insert.) Pixel plot of the reordered Enron term-by-
document matrix with term and docmnent labels.
...... 114
5.10
(See color insert.) Close-up of one section of pixel plot of the
reordered Enron term-by-document matrix
........... 115
6.1
True and approximated
к
values with
d ~
1000........ 130
6.2
Comparison of approximations for varying d,
к
= 500..... 131
6.3
Comparison of approximations for varying
f
(with
d
= 1000). 132
6.4
(See color insert.) Small-mix
dataset
and its clustering by
soft-moVMF
............................ 139
6.5
Comparison of the algorithms for the
Classicß
datasets
and
the Yahoo News
dataset
..................... 142
6.6
Comparison of the algorithms for the
20
Newsgroup and some
subsets
............................... 144
6.7
Comparison of the algorithms for more subsets of
20
News¬
group data
............................. 145
6.8
(See color insert.) Variation of entropy of hidden variables
with number of iterations
(sof t-movMF)
............ 148
7.1
Input instances and constraints
................. 158
7.2
Constraint-based clustering
................... 159
7.3
Input instances and constraints
................. 160
7.4
Distance-based clustering
.................... 160
7.5
Clustering using KMeans
..................... 164
7.6
Clustering under constraints using COP-KMeans
....... 165
7.7
DistBoost algorithm
....................... 169
7.8
A hidden Markov random field
................. 171
7.9
Graphical plate model of variable dependence
......... 171
7.10
HMRF-KMeans algorithm
................... 174
7.11
Comparison of cosine and Euclidean distance
......... 178
7.12
Results on
News-Different-ă.
.................. 178
7.13
Results on News-Related-3
.................... 179
7.14
Results on News-Similar-S.
................... 179
8.1
A typical filtering system. A filtering system can serve many
users, although only one user is shown in the figure. Infor¬
mation can be documents, images, or videos. Without loss of
generality, we focus on text documents in this chapter.
... 186
8.2
Illustration of dependencies of variables in the hierarchical
model. The rating, y, for a document, re, is conditioned on
the document and the user model, wm, associated with the
user m. Users share information about their models through
the prior,
Φ = (μ,Σ).......................
195
9.1
PNDCU Scores of Indri and
CAFÉ
for two dampen¬
ing factors (p), and various settings (PRP:
Pseudo
Rele¬
vance Feedback, F: Feedback, N: Novelty Detection, A: Anti-
Redundant Ranking)
....................... 227
9-2
Performance of CAFE and Indri across ctonks.
........ 228
XVI
10.1
(See
color
insert.)
Document
as a linear sequence of tokens,
some connected to a type hierarchy. Some sample queries and
their approximate translation to a semi-structured form are
shown
............................... 235
10.2
(See color insert.) The IR4QA system that we describe in this
paper
................................ 237
10.3
Summary of
%
accuracy for UIUC data. (1) SNoW accuracy
without the related word dictionary was not reported. With
the related-word dictionary, it achieved
91%.
^ SNoW with
a related-word dictionary achieved
84.2%
but the other algo¬
rithms did not use it. Our results are summarized in the last
two rows; see text for details
.................. 240
10.4 2-
and 3-state transition models
................ 241
10.5
Stanford Parser output example
................ 242
10.6
A multi-resolution tabular view of the question parse showing
tag and
num
attributes in each cell, capital city is the informer
span with
y
= 1......................... 242
10.7
The meta-learning approach
.................. 245
10.8
Effect of feature choices
..................... 248
10.9
A significant boost in question classification accuracy is seen
when two levels of non-local features are provided to the
SVM,
compared to just the
POS
features at the leaf of the parse tree.
249
10.10
Effect of number of CRF states, and comparison with the
heuristic baseline (Jaccard accuracy expressed as
%)..... 250
10.11
Percent accuracy with linear
S
VMs,
perfect informer spans
and various feature encodings. The Coarse column is for the
6
top-level UIUC classes and the fine column is for the
50
second-level classes
....................... 251
10.12
Summary of
%
accuracy broken down by broad syntactic ques¬
tion types, a: question bigrams, b: perfect informers only,
c: heuristic informers only, d: CRP informers only,
е
-g:
bi¬
grams plus perfect, heuristic and CRF informers
....... 252
10.13
(See color insert.) Setting up the proximity scoring problem.
254
10.14
Relative CPU times needed by RankSVM and RankExp as a
function of the number of ordering constraints
........ 258
10.15 ßj
shows a noisy unimodal pattern.
.............. 259
10.16
End-to-end accuracy using RankExp
β
is significantly better
than IR-style ranking. Train and test years are from
1999,
2000, 2001.
R300 is recall at
к
= 300
out of
261
test questions.
С
= 0.1,
G =
1
and
С —
10
gave almost identical results.
. 259
10.17
Relative sizes of the corpus and varkms indexes for
TREC
2000........................... 261
10.18
Highly skewed atype frequencies in
TREC
query logs.
, . . 261
10.19
Log likelihood of validation data against the
Lidstone
smooth¬
ing parameter
L
......»................. 263
XVII
10.20
Pre-generalization
and post-filtering
.............. 263
10.21
Sizes of the additional indices needed for pre-generalize and
post-filter query processing, compared to the usual indices for
TREC
2000........................... 265
10.22
Y^aeR corpus
С
ount(a) is a very good predictor of the size of
the atype subset index. (Root atypes are not indexed.)
. . . 266
10.23
¿scan is sufficiently concentrated that replacing the distribution
by a constant number is not grossly inaccurate
........ 267
10.24
Like ¿scan, ¿forward is concentrated and can be reasonably re¬
placed by a point estimate
................... 268
10.25
Scatter of observed against estimated query bloat
...... 269
10.26
Histogram of observed-to-estimated bloat ratio for individual
queries with a specific
R
occupying an estimated 145 MB of
atype index
............................ 269
10.27
The inputs are atype set A and workload W. The output is a
series of trade-offs between index size of
R
and average query
bloat over
W
........................... 270
10.28
(See color insert.) Estimated space-time tradeoffs produced
by AtypeSubsetChooser. The y-axis uses a log scale. Note
that the curve for I
=
10~3 (suggested by Figure
10.19)
has
the lowest average bloat
..................... 272
10.29
Estimated bloat for various values of
£
for a specific estimated
index size of 145 MB. The y-axis uses a log scale
....... 278
10.30
Estimated and observed space-time tradeoffs produced by
AtypeSubsetChooser
..................... 274
10.31
Average time per query (with and without generalization) for
various estimated index sizes
.................. 275
List of Tables
2.1
Number of news items collected from different outlets
..... 31
2.2
Number of discovered news pairs
................. 33
2.3
Results for outlet identification of a news item
........ 36
2.4
Results for news outlet identification of a news item from the
set of news item pairs
...................... 37
2.5
Main topics covered by CNN or
Al Jazeera
.......... 40
2.6
Number of discovered pairs
................... 41
2.7
Conditional probabilities of a story
............... 41
2.8
Number of news articles covered by all four news outlets
... 42
2.9
ВЕР
metric distances
...................... 43
3.1
Accuracy results for WebKB. CC algorithms outperformed
their CO counterparts significantly, and LR versions outper¬
formed NB versions significantly. The differences between ICA-
NB and GS-NB, and the differences between ICA-LR and GS-
LR, are not statistically significant. Both LBP and
MF
out¬
performed ICA-LR and GS-LR significantly.
.......... 62
3.2
Accuracy results for the Cora
dataset.
CC
algorithms outper¬
formed their CO counterparts significantly. LR versions signif¬
icantly outperformed NB versions. ICA-NB outperformed GS-
NB for
SS
and M, the other differences between
ICA
and GS
were not significant (both NB and LR versions). Even though
MF
outperformed ICA-LR, GS-LR, and LBP, the differences
were not statistically significant
................. 63
3.3
Accuracy results for the CiteSeer
dataset.
CC
algorithms sig¬
nificantly outperformed their CO counterparts except for IGA-
NB and GS-NB for matched cross-validation. CO and CC al¬
gorithms based on LR outperformed the NB versions, but the
differences were not significant. ICA-NB outperformed GS-NB
significantly for
SS;
but, the rest of the differences between LR,
versions of
ICA
and GS, LBP and
MF
were not significant.
. 64
5.1
Eleven of the
197
email authors represented in the term-author-
time array X
............................ 103
6.1
Approximations
к
for a sampling of
к
and
d
values
.......
12Ô
6.2
True and estimated parameters for small-mix
......... 139
XX
6.3 Performance
of sof
t-moVMF on big-mix
dataset
......... 140
6.4
Comparative
confusion matrices for
3
clusters of Classics (rows
represent clusters)
......................... 140
6.5
Comparative confusion matrices for
3
clusters of ClassicSOO.
. 141
6.6
Comparative confusion matrices for
3
clusters of Classic400.
. 141
6.7
Comparative confusion matrices for
5
clusters of Classics.
. . 141
6.8
Performance comparison of algorithms averaged over
5
runs.
. 145
6.9
Five of the topics obtained by running batch vMF on slash-7.
146
7.1
Text
datasets
used in experimental evaluation
......... 175
8.1
The values assigned to relevant and non-relevant documents
that the filtering system did and did not deliver. R~, R+, N+,
and N~ correspond to the number of documents that fall into
the corresponding category.
Ar,
An,
Br.
and
Β χ
correspond
to the credit/penalty for each element in the category.
. . . 188
|
any_adam_object | 1 |
author_GND | (DE-588)120408945 |
building | Verbundindex |
bvnumber | BV035605094 |
callnumber-first | Q - Science |
callnumber-label | QA76 |
callnumber-raw | QA76.9.D343 |
callnumber-search | QA76.9.D343 |
callnumber-sort | QA 276.9 D343 |
callnumber-subject | QA - Mathematics |
classification_rvk | ST 302 ST 306 ST 530 |
ctrlnum | (OCoLC)144226505 (DE-599)GBV595954707 |
dewey-full | 006.3/12 |
dewey-hundreds | 000 - Computer science, information, general works |
dewey-ones | 006 - Special computer methods |
dewey-raw | 006.3/12 |
dewey-search | 006.3/12 |
dewey-sort | 16.3 212 |
dewey-tens | 000 - Computer science, information, general works |
discipline | Informatik |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01712nam a2200433 c 4500</leader><controlfield tag="001">BV035605094</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20160204 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">090707s2009 a||| |||| 00||| eng d</controlfield><datafield tag="010" ind1=" " ind2=" "><subfield code="a">2009013047</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781420059403</subfield><subfield code="c">hardcover</subfield><subfield code="9">978-1-4200-5940-3</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)144226505</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)GBV595954707</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">aacr</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-473</subfield><subfield code="a">DE-703</subfield><subfield code="a">DE-20</subfield><subfield code="a">DE-355</subfield><subfield code="a">DE-M382</subfield></datafield><datafield tag="050" ind1=" " ind2="0"><subfield code="a">QA76.9.D343</subfield></datafield><datafield tag="082" ind1="0" ind2=" "><subfield code="a">006.3/12</subfield><subfield code="2">22</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 302</subfield><subfield code="0">(DE-625)143652:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 306</subfield><subfield code="0">(DE-625)143654:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 530</subfield><subfield code="0">(DE-625)143679:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Text mining</subfield><subfield code="b">classification, clustering, and applications</subfield><subfield code="c">ed. by Ashok Srivastava ...</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Boca Raton [u.a.]</subfield><subfield code="b">CRC Press</subfield><subfield code="c">2009</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XXX, 290 S.</subfield><subfield code="b">Ill., graf. Darst.</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="0" ind2=" "><subfield code="a">Data mining and knowledge discovery series</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Includes bibliographical references and index</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Data mining / Statistical methods</subfield></datafield><datafield tag="650" ind1=" " ind2="4"><subfield code="a">Data mining</subfield><subfield code="x">Statistical methods</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Text Mining</subfield><subfield code="0">(DE-588)4728093-1</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="655" ind1=" " ind2="7"><subfield code="0">(DE-588)4143413-4</subfield><subfield code="a">Aufsatzsammlung</subfield><subfield code="2">gnd-content</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Text Mining</subfield><subfield code="0">(DE-588)4728093-1</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Srivastava, Ashok K.</subfield><subfield code="e">Sonstige</subfield><subfield code="0">(DE-588)120408945</subfield><subfield code="4">oth</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Bamberg</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=017660340&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-017660340</subfield></datafield></record></collection> |
genre | (DE-588)4143413-4 Aufsatzsammlung gnd-content |
genre_facet | Aufsatzsammlung |
id | DE-604.BV035605094 |
illustrated | Illustrated |
indexdate | 2024-07-09T21:41:28Z |
institution | BVB |
isbn | 9781420059403 |
language | English |
lccn | 2009013047 |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-017660340 |
oclc_num | 144226505 |
open_access_boolean | |
owner | DE-473 DE-BY-UBG DE-703 DE-20 DE-355 DE-BY-UBR DE-M382 |
owner_facet | DE-473 DE-BY-UBG DE-703 DE-20 DE-355 DE-BY-UBR DE-M382 |
physical | XXX, 290 S. Ill., graf. Darst. |
publishDate | 2009 |
publishDateSearch | 2009 |
publishDateSort | 2009 |
publisher | CRC Press |
record_format | marc |
series2 | Data mining and knowledge discovery series |
spelling | Text mining classification, clustering, and applications ed. by Ashok Srivastava ... Boca Raton [u.a.] CRC Press 2009 XXX, 290 S. Ill., graf. Darst. txt rdacontent n rdamedia nc rdacarrier Data mining and knowledge discovery series Includes bibliographical references and index Data mining / Statistical methods Data mining Statistical methods Text Mining (DE-588)4728093-1 gnd rswk-swf (DE-588)4143413-4 Aufsatzsammlung gnd-content Text Mining (DE-588)4728093-1 s DE-604 Srivastava, Ashok K. Sonstige (DE-588)120408945 oth Digitalisierung UB Bamberg application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=017660340&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Text mining classification, clustering, and applications Data mining / Statistical methods Data mining Statistical methods Text Mining (DE-588)4728093-1 gnd |
subject_GND | (DE-588)4728093-1 (DE-588)4143413-4 |
title | Text mining classification, clustering, and applications |
title_auth | Text mining classification, clustering, and applications |
title_exact_search | Text mining classification, clustering, and applications |
title_full | Text mining classification, clustering, and applications ed. by Ashok Srivastava ... |
title_fullStr | Text mining classification, clustering, and applications ed. by Ashok Srivastava ... |
title_full_unstemmed | Text mining classification, clustering, and applications ed. by Ashok Srivastava ... |
title_short | Text mining |
title_sort | text mining classification clustering and applications |
title_sub | classification, clustering, and applications |
topic | Data mining / Statistical methods Data mining Statistical methods Text Mining (DE-588)4728093-1 gnd |
topic_facet | Data mining / Statistical methods Data mining Statistical methods Text Mining Aufsatzsammlung |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=017660340&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT srivastavaashokk textminingclassificationclusteringandapplications |