Web corpus construction:
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
[San Rafael, Calif.]
Morgan & Claypool
2013
|
Schriftenreihe: | Synthesis lectures on human language technologies
22 |
Online-Zugang: | Inhaltsverzeichnis |
Beschreibung: | 129 S. graph. Darst. |
ISBN: | 9781608459834 |
Internformat
MARC
LEADER | 00000nam a2200000 cb4500 | ||
---|---|---|---|
001 | BV041806916 | ||
003 | DE-604 | ||
005 | 20140516 | ||
007 | t | ||
008 | 140416s2013 d||| |||| 00||| eng d | ||
020 | |a 9781608459834 |c pbk. |9 978-1-60845-983-4 | ||
035 | |a (OCoLC)882442685 | ||
035 | |a (DE-599)BVBBV041806916 | ||
040 | |a DE-604 |b ger |e rakwb | ||
041 | 0 | |a eng | |
049 | |a DE-188 |a DE-19 | ||
084 | |a ES 900 |0 (DE-625)27926: |2 rvk | ||
100 | 1 | |a Schäfer, Roland |e Verfasser |4 aut | |
245 | 1 | 0 | |a Web corpus construction |c Roland Schäfer ; Felix Bildhauer |
264 | 1 | |a [San Rafael, Calif.] |b Morgan & Claypool |c 2013 | |
300 | |a 129 S. |b graph. Darst. | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
490 | 1 | |a Synthesis lectures on human language technologies |v 22 | |
700 | 1 | |a Bildhauer, Felix |e Verfasser |4 aut | |
830 | 0 | |a Synthesis lectures on human language technologies |v 22 |w (DE-604)BV035447238 |9 22 | |
856 | 4 | 2 | |m HBZ Datenaustausch |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027252354&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-027252354 |
Datensatz im Suchindex
_version_ | 1804152130334883840 |
---|---|
adam_text | Titel: Web Corpus Construction
Autor: Schäfer, Roland
Jahr: 2013
..... . . .
Contents
Preface xiii
Acknowledgments xv
1 Web Corpora 1
2 Data Collection 7
2.1 Introduction 7
2.2 Hie Structure of the Web 8
2.2.1 General Properties 8
2.2.2 Accessibility and Stability of Web pages 9
2.2.3 What s in a (National) Top Level Domain? 11
2.2.4 Problematic Segments of the Web 14
2.3 Crawling Basics 15
2.3.1 Introduction 15
2.3.2 Corpus Construction From Search Engine Results 16
2.3.3 Crawlers and Crawler Performance 19
2.3.4 Configuration Details and Politeness 23
2.3.5 Seed URL Generation 25
2.4 More on Crawling Strategies 28
2.4.1 Introduction 28
2.4.2 Biases and the PageRank 29
2.4.3 Focused Crawling 34
3 Post-Processing 37
3.1 Introduction 37
3.2 Basic Cleanups 3g
3.2.1 HTML stripping 38
3.2.2 Character References and Entities 41
3.2.3 Character Sets and Conversion 41
3.2.4 Further Normalization 44
3.3 Boilerplate Removal 48
3.3.1 Introduction to Boilerplate
3.3.2 Feature Extraction 50
3.3.3 Choice of the Machine Learning Method 55
3.4 Language Identification 57
3.5 Duplicate Detection 58
3.5.1 Types of Duplication 58
3.5.2 Perfect Duplicates and Hashing 60
3.5.3 Near Duplicates, Jaccard Coefficients, and Shingling 61
4 Linguistic Processing
4.1 Introduction 65
4.2 Basics of Tokenization, Part-Of-Speech Tagging, and Lemmatization 66
4.2.1 Tokenization 66
4.2.2 Part-Of-Speech Tagging 68
4.2.3 Lemmatization 69
4.3 Linguistic Post-Processing of Noisy Data 70
4.3.1 Introduction 70
4.3.2 Treatment of Noisy Data 71
4.4 Tokenizing Web Texts 72
4.4.1 Example: Missing Whitespace 72
4.4.2 Example: Emoticons 74
4.5 POS Tagging and Lemmatization of Web Texts 75
4.5.1 Tracing Back Errors in POS Tagging 75
4.6 Orthographic Normalization 79
4.7 Software for Linguistic Post-Processing 82
5 Corpus Evaluation and Comparison 85
5.1 Introduction 85
5.2 Rough Quality Check 85
5.2.1 Word and Sentence Lengths 86
5.2.2 Duplication 90
5.3 Measuring Corpus Similarity 92
5.3.1 Inspecting Frequency Lists 93
5.3.2 Hypothesis Testing with /2 94
5.3.3 Hypothesis Testing with Spearman s Rank Correlation 95
5.3.4 Using Test Statistics without Hypothesis Testing 97
5.4 Comparing Keywords 98
xi
5.4.1 Keyword Extraction with x2 99
5.4.2 Keyword Extraction Using the Ratio of Relative Frequencies 99
5.4.3 Variants and Refinements 102
5.5 Extrinsic Evaluation 104
5.6 Corpus Composition 106
5.6.1 Estimating Corpus Composition 106
5.6.2 Measuring Corpus Composition 107
5.6.3 Interpreting Corpus Composition 107
5.7 Summary 109
Bibliography Ill
Authors Biographies 129
|
any_adam_object | 1 |
author | Schäfer, Roland Bildhauer, Felix |
author_facet | Schäfer, Roland Bildhauer, Felix |
author_role | aut aut |
author_sort | Schäfer, Roland |
author_variant | r s rs f b fb |
building | Verbundindex |
bvnumber | BV041806916 |
classification_rvk | ES 900 |
ctrlnum | (OCoLC)882442685 (DE-599)BVBBV041806916 |
discipline | Sprachwissenschaft Literaturwissenschaft |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01248nam a2200313 cb4500</leader><controlfield tag="001">BV041806916</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20140516 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">140416s2013 d||| |||| 00||| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781608459834</subfield><subfield code="c">pbk.</subfield><subfield code="9">978-1-60845-983-4</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)882442685</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BVBBV041806916</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rakwb</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-188</subfield><subfield code="a">DE-19</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ES 900</subfield><subfield code="0">(DE-625)27926:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Schäfer, Roland</subfield><subfield code="e">Verfasser</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Web corpus construction</subfield><subfield code="c">Roland Schäfer ; Felix Bildhauer</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">[San Rafael, Calif.]</subfield><subfield code="b">Morgan & Claypool</subfield><subfield code="c">2013</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">129 S.</subfield><subfield code="b">graph. Darst.</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="1" ind2=" "><subfield code="a">Synthesis lectures on human language technologies</subfield><subfield code="v">22</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Bildhauer, Felix</subfield><subfield code="e">Verfasser</subfield><subfield code="4">aut</subfield></datafield><datafield tag="830" ind1=" " ind2="0"><subfield code="a">Synthesis lectures on human language technologies</subfield><subfield code="v">22</subfield><subfield code="w">(DE-604)BV035447238</subfield><subfield code="9">22</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">HBZ Datenaustausch</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027252354&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-027252354</subfield></datafield></record></collection> |
id | DE-604.BV041806916 |
illustrated | Illustrated |
indexdate | 2024-07-10T01:05:49Z |
institution | BVB |
isbn | 9781608459834 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-027252354 |
oclc_num | 882442685 |
open_access_boolean | |
owner | DE-188 DE-19 DE-BY-UBM |
owner_facet | DE-188 DE-19 DE-BY-UBM |
physical | 129 S. graph. Darst. |
publishDate | 2013 |
publishDateSearch | 2013 |
publishDateSort | 2013 |
publisher | Morgan & Claypool |
record_format | marc |
series | Synthesis lectures on human language technologies |
series2 | Synthesis lectures on human language technologies |
spelling | Schäfer, Roland Verfasser aut Web corpus construction Roland Schäfer ; Felix Bildhauer [San Rafael, Calif.] Morgan & Claypool 2013 129 S. graph. Darst. txt rdacontent n rdamedia nc rdacarrier Synthesis lectures on human language technologies 22 Bildhauer, Felix Verfasser aut Synthesis lectures on human language technologies 22 (DE-604)BV035447238 22 HBZ Datenaustausch application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027252354&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Schäfer, Roland Bildhauer, Felix Web corpus construction Synthesis lectures on human language technologies |
title | Web corpus construction |
title_auth | Web corpus construction |
title_exact_search | Web corpus construction |
title_full | Web corpus construction Roland Schäfer ; Felix Bildhauer |
title_fullStr | Web corpus construction Roland Schäfer ; Felix Bildhauer |
title_full_unstemmed | Web corpus construction Roland Schäfer ; Felix Bildhauer |
title_short | Web corpus construction |
title_sort | web corpus construction |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=027252354&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
volume_link | (DE-604)BV035447238 |
work_keys_str_mv | AT schaferroland webcorpusconstruction AT bildhauerfelix webcorpusconstruction |