Bachelor + Master Publishing
811 Bachelorarbeiten, 533 Masterarbeiten, 10.103 Diplomarbeiten

Strings of Natural Languages

Unsupervised Analysis and Segmentation on the Expression Level

Strings of Natural Languages
Über dieses Buch
  • Art: Diplomarbeit
  • Autor: Stengel
  • Abgabedatum: August 2006
  • Umfang: 146 Seiten
  • Dateigröße: 1,9 MB
  • Note: 1,0
  • Institution / Hochschule: Eberhard Karls Universität Tübingen Deutschland
  • Originaltitel: Unsupervised Analysis and Segmentation of Strings of Natural Languages on the Expression Level
  • Bibliografie: ca. 68
  • ISBN (eBook): 978-3-8366-0627-1
  • Sprache: Englisch
  • Prämierung:
  • Arbeit zitieren: Stengel, August 2006: Strings of Natural Languages, Hamburg: Diplomica Verlag
  • Schlagworte: Automatische Syntaxanalyse, Automatische Korpuserstellung, Computerlinguistik, Korpuslinguistik, Meta-Rating

Diplomarbeit von Stengel

Abstract:

Learning a second language is often difficult. One major reason for this is the way we learn: We try to translate the words and concepts of the other language into those of our own language. As long as the languages are fairly similar, this works quite well. However, when the languages differ to a great degree, problems are bound to appear. For example, to someone whose first language is French, English is not difficult to learn. In fact, he can pick up any English book and at the very least recognize words and sentences. But if he is tasked with reading a Japanese text, he will be completely lost: No familiar letters, no whitespace, and only occasionally a glyph that looks similar to a punctuation mark appears.

Nevertheless, anyone can learn any language. Correct pronunciation and understanding alien utterances may be hard for the individual, but as soon as the words are transcribed to some kind of script, they can be studied and - given some time - understood. The script thus offers itself as a reliable medium of communication.

Sometimes the script can be very complex, though. For instance, the Japanese language is not much more difficult than German - but the Japanese script is. If someone untrained in the language is given a Japanese book and told to create a list of its vocabulary, he will likely have to succumb to the task.

Or does he not? Are there maybe ways to analyze the text, regardless of his unfamiliarity with this type of script and language? Should there not be characteristics shared by all languages which can be exploited?

This thesis assumes the point of view of such a person, and shows how to segment a corpus in an unfamiliar language while employing as little previous knowledge as possible.

To this end, a methodology for the analysis of unknown languages is developed. The single requirement made is that a large corpus in electronic form which underwent only a minimum of preprocessing is available. Analysis is limited strictly to the expression level; semantics are purposefully left out of consideration. This distinguishes this work clearly from other works, limits comparability to some extent, and may make detection of some kinds of language features hard or even impossible.

Only unsupervised analysis is admissible, and no specific information on grammatical rules, ways to segment the text, what separators look like etc. is employed. Furthermore, no parameters such as absolute thresholds or selection of the n-best candidates are allowed; all parameters and evaluation must be relative and justifiable, not based on experimental results. Though this makes this thesis’ task harder, it also offers the advantage that parameters are not required, and thus need not be adjusted or optimized to fit to a corpus or language.

Chapter one gives an overview of the languages examined in this work: English, German, Hebrew and Japanese. It also argues their choice, suitability and representativeness.

Chapter two introduces categorization, a key concept in this thesis. Categorization is used for segmentation, classification and other tasks. Furthermore, some sample categorizations exemplify application of this concept.

Chapter three covers the technical basis of this work. Methods and techniques from various fields are introduced, namely data compression, bioinformatics, statistics and cryptology. The methods developed in this work employ chiefly the algorithms and concepts introduced in this chapter.

Chapter four states the tasks tackled in this work and reports results and devised methods. It starts with the experimental setup, and continues with an introduction to the evaluation and rating methodology of this thesis. Then two ways to automatically create excerpts from a corpus follow. The detection of syntactic separators and segmentation of text conclude the chapter.

Finally, chapter five summarizes this work’s achievements, and chapter six gives an outlook on possible and promising future challenges.

Table of Contents:

List of Figures vi
List of Tables viii
List of Algorithms ix
List of Abbreviations xi
Introduction 1
1. Language 3
1.1 Definitions 3
1.2 Languages 5
1.2.1 English 6
1.2.2 German 10
1.2.3 Hebrew 13
1.2.4 Japanese 14
2. Categorization 23
2.1 Definitions 23
2.2 Sample Application 24
2.3 Conclusion 26
3. Analysis Methods and Techniques 27
3.1 Level of Abstraction 27
3.2 Data Compression 27
3.2.1 Overview 27
3.2.2 Information content and its quantification 28
3.2.3 Kinds of data compression 30
3.2.4 Run length encoding 31
3.2.5 Dictionary-based data compression: LZ78, LZW, LZMW 33
3.2.6 LZMW78 37
3.2.7 Sample application 39
3.3 Longest Common Subsequence 40
3.3.1 Overview 40
3.3.2 Application 41
3.4 Statistics: N-Gram and Term Frequency 43
3.4.1 Definitions 43
3.4.2 Limited applicability of published statistics 44
3.4.3 The challenges of collecting statistics 45
3.4.4 Fixed term size 46
3.4.5 Variable term size 48
3.4.6 Suffix tree 48
3.4.7 Suffix array 52
3.5 Cryptology 56
3.5.1 Motivation 56
3.5.2 Character frequency 57
3.5.3 Index of coincidence 58
3.5.4 Patterns 59
4. Tasks and Results 61
4.1 Experimental Setup 61
4.1.1 Corpora 61
4.1.2 Preprocessing 63
4.1.3 System and implementation 64
4.1.4 Selection of results 64
4.2 Evaluation and Meta-Rating 64
4.3 Excerpting a Corpus 66
4.3.1 First type: entropy analysis 67
4.3.2 Second type: index of coincidence 69
4.4 Detecting Syntactic Separators 72
4.4.1 Character order 72
4.4.2 Repetitions 73
4.4.3 Pangrams 73
4.4.4 Compression and LCS 75
4.4.5 Aligner 79
4.5 Text Segmentation 82
4.5.1 Suffix and prefix detection 82
4.5.2 Compound splitter 88
4.5.3 Palindromes 91
4.5.4 Frequency statistics 92
5. Conclusion 97
6. Outlook 99
Glossary 101
Appendix A Extra Data 103
Appendix B Experimental Results 108
Bibliography 131
Index 137

Gerne senden wir Ihnen auf Anfrage eine Textprobe per E-Mail zu. Bitte geben Sie hierbei die Bestellnummer 10627 an und senden uns Ihre E-Mail an info(a)diplom.de.

Arbeit zitieren:
Stengel, August 2006: Strings of Natural Languages, Hamburg: Diplomica Verlag

Schlagworte:
Automatische Syntaxanalyse, Automatische Korpuserstellung, Computerlinguistik, Korpuslinguistik, Meta-Rating

diplom.de
Bachelor + Master Publishing

Hermannstal 119 k
22119 Hamburg

Fon: +49 (0) 40 655992-0
Fax: +49 (0) 40 655992-22

Service-Telefon

Rufen Sie uns an:
+49 (0) 40 655992-0

Mo-Fr
09.00-16.00 Uhr

diplom.de in den Medien

Folgen Sie uns bei Twitter & werden Sie diplom.de-Fan bei Facebook!
Schreibtipps unserer Lektoren, Neuigkeiten aus dem Verlagsalltag und das Expertenwissen unserer Autoren als Tweet & Post!
Wir freuen uns auf Sie!

diplom.de BACHELOR + MASTER PUBLISHING

Bachelorarbeiten, Masterarbeiten, Diplomarbeiten, Magisterarbeiten, Dissertationen und andere Abschlussarbeiten aus allen Fachbereichen und Hochschulen können Sie bei uns als eBook sofort per Download beziehen oder sich auf CD oder als Buch zusenden lassen. Seit mehr als 15 Jahren ist diplom.de der seriöse, professionelle und erfolgreiche Partner für die Veröffentlichung wissenschaftlicher Abschlussarbeiten.

© Diplomica Verlag GmbH 1996-2011, AG Hamburg HRB 80293 - GF Björn Bedey, USt-IdNr.: DE214910002 - Verkehrsnummer: 12285 - Impressum
Index der Arbeiten - Index der Autoren