Natural language processing for historical texts:
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
[San Rafael, Calif.]
Morgan & Claypool
2012
|
Schriftenreihe: | Synthesis lectures on human language technologies
17 |
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis Klappentext |
Beschreibung: | XI, 145 S. Ill., graph. Darst. |
ISBN: | 1608459462 9781608459469 |
Internformat
MARC
LEADER | 00000nam a2200000 cb4500 | ||
---|---|---|---|
001 | BV040705633 | ||
003 | DE-604 | ||
005 | 20180219 | ||
007 | t | ||
008 | 130129s2012 ad|| |||| 00||| eng d | ||
020 | |a 1608459462 |c paperback |9 1-60845-946-2 | ||
020 | |a 9781608459469 |c paperback |9 978-1-60845-946-9 | ||
035 | |a (OCoLC)828790373 | ||
035 | |a (DE-599)BSZ373304587 | ||
040 | |a DE-604 |b ger | ||
041 | 0 | |a eng | |
049 | |a DE-19 |a DE-188 |a DE-12 |a DE-83 |a DE-355 |a DE-739 | ||
084 | |a ST 306 |0 (DE-625)143654: |2 rvk | ||
084 | |a 24,1 |2 ssgn | ||
100 | 1 | |a Piotrowski, Michael |d 1972- |e Verfasser |0 (DE-588)139045368 |4 aut | |
245 | 1 | 0 | |a Natural language processing for historical texts |c Michael Piotrowski |
264 | 1 | |a [San Rafael, Calif.] |b Morgan & Claypool |c 2012 | |
300 | |a XI, 145 S. |b Ill., graph. Darst. | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
490 | 1 | |a Synthesis lectures on human language technologies |v 17 | |
650 | 0 | 7 | |a Sprachverarbeitung |0 (DE-588)4116579-2 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Natürliche Sprache |0 (DE-588)4041354-8 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Computerlinguistik |0 (DE-588)4035843-4 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Sprachverarbeitung |0 (DE-588)4116579-2 |D s |
689 | 0 | 1 | |a Natürliche Sprache |0 (DE-588)4041354-8 |D s |
689 | 0 | 2 | |a Computerlinguistik |0 (DE-588)4035843-4 |D s |
689 | 0 | |5 DE-604 | |
776 | 0 | 8 | |i Erscheint auch als |n Online-Ausgabe |z 978-1-60845-947-6 |
830 | 0 | |a Synthesis lectures on human language technologies |v 17 |w (DE-604)BV035447238 |9 17 | |
856 | 4 | 2 | |m Digitalisierung BSB Muenchen 21 - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=025686081&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
856 | 4 | 2 | |m Digitalisierung UB Regensburg - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=025686081&sequence=000004&line_number=0002&func_code=DB_RECORDS&service_type=MEDIA |3 Klappentext |
999 | |a oai:aleph.bib-bvb.de:BVB01-025686081 |
Datensatz im Suchindex
_version_ | 1804150010246332416 |
---|---|
adam_text | VII
Contents
Acknowledgments
...........................................................xi
1
Introduction
.................................................................1
1.1
Historical Languages and Modern Languages
.................................1
1.2
Intended Audience
.........................................................4
1.3
Outline
....................................................................4
2
NLP and Digital Humanities
..................................................5
2.1
Origins of Digital Humanities
...............................................5
2.2
Convergence of NLP and Digital Humanities
.................................7
2.3
Summary
.................................................................10
3
Spelling in Historical Texts
..................................................11
3.1
The Role of Orthography in NLP
...........................................11
3.2
Spelling and Historical Texts
...............................................12
3.2.1
Difference:
Diachronie
Spelling Variation
.............................14
3.2.2
Variance:
Synchronie
Spelling Variation
...............................14
3.2.3
Uncertainty
.........................................................19
3.3
Summary
.................................................................22
4
Acquiring Historical Texts
...................................................25
4.1
Digitization of Historical Texts
.............................................25
4.2
Scanning
.................................................................28
4.3
Optical Character Recognition
..............................................31
4.3.1
Using More Than One OCR System
..................................34
4.3.2
Lexical Resources for Historical OCR
.................................41
4.3.3
Collaborative Correction of OCR Output
.............................43
viii
4.4
Manual
Text Entry
........................................................48
4.5
Computer-
Assisted Transcription...........................................
49
4.6
Summary
.................................................................52
5
Text Encoding and Annotation Schemes
......................................53
5.1
Unicode for Historical Text
.................................................53
5.2
TEI
for Historical Texts
....................................................60
5.3
Summary
.................................................................67
5
Handling Spelling Variation
.................................................69
6.1
Spelling Canonicalization
..................................................69
6.2
Edit Distance
.............................................................71
6.3
Approaches for Handling Spelling Variation in Historical Texts
................73
6.3.1
Absolute Approaches to Canonicalization
..............................74
6.3.2
Relative Approaches to Canonicalization
..............................76
6.4
Detecting and Correcting OCR Errors
......................................78
6.4.1
Text-Induced Corpus Clean-up
.......................................79
6.4.2
Anomalous Text Detection
...........................................81
6.5
Limits of Spelling Canonicalization
.........................................82
6.6
Summary
.................................................................83
7
NLP Tools for Historical Languages
..........................................85
7.1
Part-of-Speech Tagging
....................................................86
7.1.1
Creating
a POS
Tagger from Scratch
..................................88
7.1.2
Using a Modern-Language
POS
Tagger
...............................91
7.2
Lemmatization and Morphological Analysis
.................................96
7.3
Syntactic Parsing
..........................................................98
7.4
Summary
.................................................
Ю0
8
Historical Corpora
...................................................... 101
8.1
Arabic
................................................... 101
8.2
Chinese
............................................. 102
ix
8.3
Dutch
...................................................................104
8.4
English
..................................................................106
8.5
French
..................................................................110
8.6
German
.................................................................112
8.7
Nordic Languages
........................................................113
8.8
Latin and Ancient Greek
..................................................114
8.9
Portuguese
...............................................................115
8.10
Summary
................................................................116
Conclusion
................................................................117
Bibliography
...............................................................119
Author s Biography
.........................................................145
Series Editor: Graeme Hirst, University of Toronto
Natural Language Processing for Historical Texts
Michael Piotrowski, Leibniz. Institute of European Historyy Germany
More and more historical texts are becoming available in digital form. Digitization of paper documents is
motivated by the aim of preserving cultural heritage and making it more accessible, both to laypeople and
scholars. As digital images cannot be searched for text, digitization projects increasingly strive to create digital
text, which can be searched and otherwise automatically processed, in addition to facsimiles. Indeed, the
emerging field of digital humanities heavily relies on the availability of digital text for its studies.
Together with the increasing availability of historical texts in digital form, there is a growing interest in
applying natural language processing (NLP) methods and tools to historical texts. However, the specific
linguistic properties of historical texts—the lack of standardized orthography in particular—pose special
challenges for NLP.
This book aims to give an introduction to NLP for historical texts and an overview of the state of the art
in this field. The book starts with an overview of methods for the acquisition of historical texts (scanning and
OCR), discusses text encoding and annotation schemes, and presents examples of corpora of historical texts
in a variety of languages. The book then discusses specific methods,such as creating part-of-speech taggers
for historical languages or handling spelling variation. A final chapter analyzes the relationship between NLP
and the digital humanities.
Certain recendy emerging textual genres, such as SMS, social media, and chat messages, or newsgroup and
forum postings share a number of properties with historical texts, for example, nonstandard orthography and
grammar, and profuse use of abbreviations. The methods and techniques required for the effective processing
of historical texts are thus also of interest for research in other domains.
|
any_adam_object | 1 |
author | Piotrowski, Michael 1972- |
author_GND | (DE-588)139045368 |
author_facet | Piotrowski, Michael 1972- |
author_role | aut |
author_sort | Piotrowski, Michael 1972- |
author_variant | m p mp |
building | Verbundindex |
bvnumber | BV040705633 |
classification_rvk | ST 306 |
ctrlnum | (OCoLC)828790373 (DE-599)BSZ373304587 |
discipline | Informatik |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>02146nam a2200433 cb4500</leader><controlfield tag="001">BV040705633</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20180219 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">130129s2012 ad|| |||| 00||| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">1608459462</subfield><subfield code="c">paperback</subfield><subfield code="9">1-60845-946-2</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781608459469</subfield><subfield code="c">paperback</subfield><subfield code="9">978-1-60845-946-9</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)828790373</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BSZ373304587</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-19</subfield><subfield code="a">DE-188</subfield><subfield code="a">DE-12</subfield><subfield code="a">DE-83</subfield><subfield code="a">DE-355</subfield><subfield code="a">DE-739</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 306</subfield><subfield code="0">(DE-625)143654:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">24,1</subfield><subfield code="2">ssgn</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Piotrowski, Michael</subfield><subfield code="d">1972-</subfield><subfield code="e">Verfasser</subfield><subfield code="0">(DE-588)139045368</subfield><subfield code="4">aut</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Natural language processing for historical texts</subfield><subfield code="c">Michael Piotrowski</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">[San Rafael, Calif.]</subfield><subfield code="b">Morgan & Claypool</subfield><subfield code="c">2012</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">XI, 145 S.</subfield><subfield code="b">Ill., graph. Darst.</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="490" ind1="1" ind2=" "><subfield code="a">Synthesis lectures on human language technologies</subfield><subfield code="v">17</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Sprachverarbeitung</subfield><subfield code="0">(DE-588)4116579-2</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Natürliche Sprache</subfield><subfield code="0">(DE-588)4041354-8</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Computerlinguistik</subfield><subfield code="0">(DE-588)4035843-4</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Sprachverarbeitung</subfield><subfield code="0">(DE-588)4116579-2</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Natürliche Sprache</subfield><subfield code="0">(DE-588)4041354-8</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="2"><subfield code="a">Computerlinguistik</subfield><subfield code="0">(DE-588)4035843-4</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="776" ind1="0" ind2="8"><subfield code="i">Erscheint auch als</subfield><subfield code="n">Online-Ausgabe</subfield><subfield code="z">978-1-60845-947-6</subfield></datafield><datafield tag="830" ind1=" " ind2="0"><subfield code="a">Synthesis lectures on human language technologies</subfield><subfield code="v">17</subfield><subfield code="w">(DE-604)BV035447238</subfield><subfield code="9">17</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung BSB Muenchen 21 - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=025686081&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Regensburg - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=025686081&sequence=000004&line_number=0002&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Klappentext</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-025686081</subfield></datafield></record></collection> |
id | DE-604.BV040705633 |
illustrated | Illustrated |
indexdate | 2024-07-10T00:32:07Z |
institution | BVB |
isbn | 1608459462 9781608459469 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-025686081 |
oclc_num | 828790373 |
open_access_boolean | |
owner | DE-19 DE-BY-UBM DE-188 DE-12 DE-83 DE-355 DE-BY-UBR DE-739 |
owner_facet | DE-19 DE-BY-UBM DE-188 DE-12 DE-83 DE-355 DE-BY-UBR DE-739 |
physical | XI, 145 S. Ill., graph. Darst. |
publishDate | 2012 |
publishDateSearch | 2012 |
publishDateSort | 2012 |
publisher | Morgan & Claypool |
record_format | marc |
series | Synthesis lectures on human language technologies |
series2 | Synthesis lectures on human language technologies |
spelling | Piotrowski, Michael 1972- Verfasser (DE-588)139045368 aut Natural language processing for historical texts Michael Piotrowski [San Rafael, Calif.] Morgan & Claypool 2012 XI, 145 S. Ill., graph. Darst. txt rdacontent n rdamedia nc rdacarrier Synthesis lectures on human language technologies 17 Sprachverarbeitung (DE-588)4116579-2 gnd rswk-swf Natürliche Sprache (DE-588)4041354-8 gnd rswk-swf Computerlinguistik (DE-588)4035843-4 gnd rswk-swf Sprachverarbeitung (DE-588)4116579-2 s Natürliche Sprache (DE-588)4041354-8 s Computerlinguistik (DE-588)4035843-4 s DE-604 Erscheint auch als Online-Ausgabe 978-1-60845-947-6 Synthesis lectures on human language technologies 17 (DE-604)BV035447238 17 Digitalisierung BSB Muenchen 21 - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=025686081&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis Digitalisierung UB Regensburg - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=025686081&sequence=000004&line_number=0002&func_code=DB_RECORDS&service_type=MEDIA Klappentext |
spellingShingle | Piotrowski, Michael 1972- Natural language processing for historical texts Synthesis lectures on human language technologies Sprachverarbeitung (DE-588)4116579-2 gnd Natürliche Sprache (DE-588)4041354-8 gnd Computerlinguistik (DE-588)4035843-4 gnd |
subject_GND | (DE-588)4116579-2 (DE-588)4041354-8 (DE-588)4035843-4 |
title | Natural language processing for historical texts |
title_auth | Natural language processing for historical texts |
title_exact_search | Natural language processing for historical texts |
title_full | Natural language processing for historical texts Michael Piotrowski |
title_fullStr | Natural language processing for historical texts Michael Piotrowski |
title_full_unstemmed | Natural language processing for historical texts Michael Piotrowski |
title_short | Natural language processing for historical texts |
title_sort | natural language processing for historical texts |
topic | Sprachverarbeitung (DE-588)4116579-2 gnd Natürliche Sprache (DE-588)4041354-8 gnd Computerlinguistik (DE-588)4035843-4 gnd |
topic_facet | Sprachverarbeitung Natürliche Sprache Computerlinguistik |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=025686081&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=025686081&sequence=000004&line_number=0002&func_code=DB_RECORDS&service_type=MEDIA |
volume_link | (DE-604)BV035447238 |
work_keys_str_mv | AT piotrowskimichael naturallanguageprocessingforhistoricaltexts |