Bachelor + Master Publishing
811 Bachelorarbeiten, 533 Masterarbeiten, 10.103 Diplomarbeiten

Semi-automatic ontology engineering and ontology supported document indexing in a multilingual environment

Semi-automatic ontology engineering and ontology supported document indexing in a multilingual environment
Über dieses Buch
  • Art: Diplomarbeit
  • Autor: Boris Lauser
  • Abgabedatum: Januar 2003
  • Umfang: 130 Seiten
  • Dateigröße: 1,8 MB
  • Note: 1,3
  • Institution / Hochschule: Universität Fridericiana Karlsruhe (TH) Deutschland
  • ISBN (eBook): 978-3-8324-6905-4
  • ISBN (Paperback) :
    978-3-8324-6905-4 P
  • ISBN (CD) :978-3-8324-6905-4 CD
  • Sprache: Englisch
  • Prämierung:
  • Arbeit zitieren: Lauser, Boris Januar 2003: Semi-automatic ontology engineering and ontology supported document indexing in a multilingual environment, Hamburg: Diplomica Verlag
  • Schlagworte: Klassifikation, Pruning, Multi-Label-Klassifikation, Multilingual, Thesaurus

Diplomarbeit von Boris Lauser

Introduction:

The management of large amounts of information and knowledge is of ever increasing importance in today’s large organisations. With the ongoing ease of supplying information online, especially in corporate intranets and knowledge bases, finding the right information becomes an increasingly difficult task. Today’s search tools perform rather poorly in the sense that information access is mostly based on keyword searching or even mere browsing of topic areas. This unfocused approach often leads to undesired results. The following example illustrates the problem more clearly: An agriculture scientist would like to find out which organisation established the Agreement on Agriculture. A simple search for „establish Agreement on Agriculture” might result in a huge list of documents containing these words, but actually none of them containing the desired result: WTO or World Trade Organisation. The problem becomes even worse if the result searched for only appears in a foreign language document.

Semantically annotated documents, i.e. documents that are indexed with ontological terms and concepts instead of simple keywords, provide several advantages. First, the ontological abstraction provides robustness against changes in the document. In the above example, the document representation might change using the term ‘Agricultural Agreement’ instead of ‘Agreement on Agriculture’. However, since the document has been annotated with the ontological semantics, this will not affect the search results. Second, since the ontology used for annotating the document in this example is domain-specific, the semantic meanings and interpretations of keywords are bound to that domain and therefore the retrieval is likely to be more efficient. A term can have several meanings in different domains. By first mapping the keyword to its semantic representation in a specific ontology and using the ontology’s linked knowledge structure, a much more focused search approach can be taken. Third, document specific representations no longer affect the search. This is extremely important in the case of multilingual representations. Keywords of several languages are mapped to the same concept in an ontology and are therefore given the same meaning. Multilingual search portals can be established to produce the same results, no matter which language is used for retrieval.

An important task in knowledge management facilitating above described search scenario id the classification and indexing of documents. At present, subject specialists are responsible for this time consuming process. However, with today’s vast amount of available information on the WWW, automatic support is needed to efficiently manage this task. Ontologies play a critical role in supporting the machine readable semantics needed to facilitate automation.

They can be used for providing the categories and keywords needed to describe the content of documents. Automatic text classification tools still lack the necessary precision to replace human indexers and need to be extensively evaluated in different domains. Before such powerful Semantic Web1 applications can be built and used within certain domains of knowledge, the basic requirement - a machine readable vocabulary represented by a domain ontology - has to be established. The creation of ontologies is a time consuming task and often carried out in an ad-hoc manner. Only few methodologies exist and existing ones are often extremely complex and need extensive training and expertise. Even less automated tool support is available. Constituting the knowledge base for future Semantic Web applications, domain ontologies have to be created continuously in all possible areas and communities. The need for a reusable methodology is evident.

Table of Contents:

1. INTRODUCTION 1
1.1 MOTIVATION 1
1.2 APPROACH 3
1.3 OUTLINE 4
2. THE PROJECT ENVIRONMENT 5
2.1 FAO AND THE AOS 5
2.2 INFORMATION MANAGEMENT AT THE FAO 7
2.2.1 Resources and metadata 7
2.2.2 The information management system 8
2.2.3 AGROVOC Thesaurus and Document Indexing 10
2.3 PROBLEMS WITH THE CURRENT SYSTEM AND PROPOSAL 13
3. SEMANTIC WEB 15
3.1 THE IDEA 15
3.2 ONTOLOGIES 17
3.2.1 Introduction 17
3.2.2 Types of ontologies 20
3.2.3 Ontology representation languages 22
3.2.4 KAON 25
3.2.5 Ontology Engineering 27
4. INTRODUCTION OF ONTOLOGY BASED INFORMATION MANAGEMENT SYSTEM AT THE FAO 29
4.1 THE PROTOTYPE PROJECT 29
4.2 REQUIREMENTS REGARDING THE AOS 30
4.3 ONTOLOGY ENGINEERING FRAMEWORK 32
4.3.1 Overview 32
4.3.2 Initialisation of the cycle 33
4.3.3 The 5 phases of the framework 35
4.4 THE ONTOLOGY BROWSER 40
4.5 REPRESENTATION OF AGROVOC IN KAON 42
4.6 RELATED WORK AND POSITIONING 46
4.7 CURRENT STATUS AND FURTHER WORK 48
5. THE ONTOLOGY PRUNER 50
5.1 INTRODUCTION TO THE PRUNING APPROACH 50
5.2 ADAPTATION OF THE ONTOLOGY PRUNER 53
5.3 EVALUATION 56
5.3.1 Resources: Document corpus and source ontology 56
5.3.2 Hypotheses for evaluation 58
5.3.3 Evaluation plan 59
5.4 RESULTS AND DISCUSSION 60
5.4.1 Pruner Trie vs. Pruner 61
5.4.2 Dependency of the statistics on different parameter settings 61
5.4.3 Generic Document Set 1 (Gen) vs. Generic Document Set 2 (AG) 62
5.4.4 Empirical evaluation 63
5.5 SUMMARY 67
6. AUTOMATIC CLASSIFICATION 69
6.1 INTRODUCTION 69
6.1.1 What is text categorisation? 69
6.1.2 Motivation within the project context 69
6.2 BASIC DEFINITIONS 70
6.2.1 Using Support Vector Machines for Multi-label Document Indexing 70
6.2.2 Evaluation measures 74
6.3 ADAPTATION OF THE CLASSIFIER 78
6.3.1 Multi-label vs. single-label Indexing 78
6.3.2 Multiple Languages 80
6.3.3 Integration of background knowledge 80
6.3.4 Multi-class problem and class hierarchy 83
6.4 SET OF TRAINING AND TEST DOCUMENTS 85
6.5 EVALUATION 89
6.5.1 Single-label vs. multi-label classification 89
6.5.2 Multilingual classification 96
6.5.3 Integration of domain specific background knowledge 98
6.6 RELATED WORK 100
6.7 SUMMARY AND OUTLOOK 101
7. CONCLUSION 103
7.1 SUMMARY 103
7.2 OUTLOOK 105
REFERENCES 106
A KAON RDFS REPRESENTATION OF THE ONTOLOGY ON FOOD SAFETY, ANIMAL AND PLANT HEALTH (EXTRACT) 113
B COMPLETE LIST OF WEB SITES OUTPUT BY THE FOCUSED CRAWLER 114
C AGROVOC CATEGORIES 119
D RESULTS OF ONTOLOGY INTEGRATION INTO AUTOMATIC TEXT CLASSIFICATION 123

Automatisiert erstellter Textauszug:

Conversion of a thesaurus or another already existing vocabulary into an ontology is still more of an art than a well-defined process and has to be carefully considered in each specific case. One approach mapping an Art and Architecture Thesaurus to an ontology using Protégé21, which is similar to the here chosen one is described in Wielinga et al. [WSWS01]. Another interesting methodology for integrating ontologies and thesauri to build RDFS schemas is given in [AF99]. The methodology discussed there is platform independent and describes a process of several steps on how to map thesaurus term to an ontology. In this approach, an ontology structure of super-concepts is first created, followed by the mapping of the thesaurus terms to fit into this high level structure of top concepts. Here a more straightforward approach has been chosen, mapping terms directly to concepts and leaving the thesaurus hierarchy structure as is in the first instance. The resolving of this and establishment and restructuring using super-concepts is left for further assessment as explained before in the engineering framework. The first problem in converting a thesaurus arises with the question of what is a concept. As described before in chapter 2, AGROVOC is a collection of terms, each term being either a descriptor or a non-descriptor. The reason for this distinction comes from the purpose of a thesaurus to be used solely for indexing purposes. An ontology, however, is somewhat more generic and can be used in different ways, therefore not reflecting this very specific differentiation. One possible solution would be to simply have a concept for each keyword and link the former non-descriptors with the same relations (use and used for relation) to the remainder of the mapped concepts. The disadvantage of this approach is, however, that in the core ontology these relationships are unlikely to exist and to be used, since the KAON ontology model provides different modelling constructs of reflecting such relationships. Basically, a descriptor and a non-descriptor together refer to the same concept, deriving from the indexing rule to use a descriptor instead of a non-descriptor for indexing. The synonym concept of the KAON Lexical OIModel reflects such a relationship. Experience has shown that a non-descriptor in most cases can be seen as a synonym of its descriptor. The option chosen here is therefore to map all descriptor terms to a concept with a label in each language provided in the AGROVOC. Each non-descriptor has been mapped to synonyms attached to its mapped descriptor concept. For each translation available in the AGROVOC, a lexical [...]

API either from a database or an RDFS file. For more details on the overall architecture refer to the developer guide20 on the KAON web site. When called with the information above, a new browser session is invoked. The user is then able to browse and search the whole ontology structure in all available languages. At the current stage of the project the parameter handling is not implemented, since it is subject to application specific interfacing and therefore not part of the framework. The KAON Portal has been extended, however, with the functionality to mark terms while browsing the ontology. The user can browse the ontology in all the five FAO languages and compile a query string consisting of an arbitrary number of marked entities of the ontology. The lexicalizations of the entities are made visible to the user. The browsing session stores the entity URIs together with the lexicalizations (labels in the currently active language respectively) to be used by and backed up into the calling application (which would be the CDS system in this case). Figure 15 shows a screenshot of the adjusted KAON Portal for the OFsAPH. [...]

format. In this project, the AGROVOC thesaurus has been chosen as input resource. Since the representation of the AGROVOC as a KAON ontology is not only used for this step but also serves as a resource for the automatic text classification algorithm discussed later in chapter 6, the conversion will be discussed in more detail in a separate section later in this chapter. The aim of this acquisition step is to automatically extract possibly the whole subset of this ontology structure, relevant for the target domain. An ontology pruner, developed in [Volz00] and formerly applied in [KVM00] has been extended and adapted to accomplish this task. The approach is a heuristic, for which reason a detailed and extensive evaluation of the algorithm has been carried out. The algorithm, its application and evaluation will be discussed in detail in chapter 5. The input to the pruner is the converted vocabulary to be reused together with two sets of documents, a domain specific one and a generic one, both explained in detail in chapter 5. The output of this step is a subset of the initial ontology, i.e. the input ontology pruned to contain only the concepts relevant for this domain. The result has to be assessed and validated by subject experts and combined with the core ontology. This is done in the next phase of the engineering cycle. In the prototype project the not yet adapted version of the pruner has been applied, again due to time restrictions and the need for some fast results. The set of domain specific documents used for Text-To-Onto in the previous step has been reused here. An ontology with 504 concepts has been extracted from the AGROVOC in this application. Since an extensive evaluation of the algorithm follows in Chapter 5, I will not go into more detail with this prototype result. The evaluation presented in chapter 5 is actually the main part of the second application of the engineering cycle. [...]

Arbeit zitieren:
Lauser, Boris Januar 2003: Semi-automatic ontology engineering and ontology supported document indexing in a multilingual environment, Hamburg: Diplomica Verlag

Schlagworte:
Klassifikation, Pruning, Multi-Label-Klassifikation, Multilingual, Thesaurus

Entdecken Sie mehr zum Thema

diplom.de
Bachelor + Master Publishing

Hermannstal 119 k
22119 Hamburg

Fon: +49 (0) 40 655992-0
Fax: +49 (0) 40 655992-22

Service-Telefon

Rufen Sie uns an:
+49 (0) 40 655992-0

Mo-Fr
09.00-16.00 Uhr

diplom.de in den Medien

Folgen Sie uns bei Twitter & werden Sie diplom.de-Fan bei Facebook!
Schreibtipps unserer Lektoren, Neuigkeiten aus dem Verlagsalltag und das Expertenwissen unserer Autoren als Tweet & Post!
Wir freuen uns auf Sie!

diplom.de BACHELOR + MASTER PUBLISHING

Bachelorarbeiten, Masterarbeiten, Diplomarbeiten, Magisterarbeiten, Dissertationen und andere Abschlussarbeiten aus allen Fachbereichen und Hochschulen können Sie bei uns als eBook sofort per Download beziehen oder sich auf CD oder als Buch zusenden lassen. Seit mehr als 15 Jahren ist diplom.de der seriöse, professionelle und erfolgreiche Partner für die Veröffentlichung wissenschaftlicher Abschlussarbeiten.

© Diplomica Verlag GmbH 1996-2011, AG Hamburg HRB 80293 - GF Björn Bedey, USt-IdNr.: DE214910002 - Verkehrsnummer: 12285 - Impressum
Index der Arbeiten - Index der Autoren