Adaptive Semi-structured Information Extraction

The number of domains and tasks where information extraction tools can be used needs to be increased. One way to reach this goal is to construct user-driven information extraction systems where novice users are able to adapt them to new domains and tasks. To accomplish this goal, the systems need to become more intelligent and able to learn to extract information without need of expert skills or time-consuming work from the user.The type of information extraction system that is in focus for this thesis is semistructural information extraction. The term semi-structural refers to documents that not only contain natural language text but also additional structural information. The typical application is information extraction from World Wide Web hypertext documents. By making effective use of not only the link structure but also the structural information within each such document, user-driven extraction systems with high performance can be built.The extraction process contains several steps where different types of techniques are used. Examples of such types of techniques are those that take advantage of structural, pure syntactic, linguistic, and semantic information. The first step that is in focus for this thesis is the navigation step that takes advantage of the structural information…

Contents

1 Introduction
1.1 Thesis Goals
1.2 Thesis Overview
2 Background
2.1 Knowledge Management
2.2 Information Retrieval
2.3 Information Extraction
2.4 Semantic Web
2.5 Semi-structured IE
2.6 Agent-Oriented Development
2.6.1 The Agent Concept
2.6.2 Agent-Oriented Programming
2.6.3 Agent Properties
2.6.4 Agent Architectures
2.6.5 Agent Design
2.6.6 Agent Communication
2.6.7 Agent Frameworks
2.7 The Buyer’s Guide System
2.7.1 System Architecture
2.7.2 Extraction Approach
2.7.3 Extraction Problems
2.7.4 Conclusions
3 The Extraction Task
3.1 Extraction Task Types
3.2 System Modes
3.2.1 The Training Mode
3.2.2 The Extraction Mode
3.2.3 The Query Mode
3.3 The Hypertext Model
3.4 The Navigation Step
3.4.1 Learning to Navigate
3.5 Additional Steps
4 Design for the ASIE System
4.1 Methodology
4.2 Agent Platform
4.3 System Architecture
4.3.1 The Butler Agent
4.3.2 The Surfer Agent
4.3.3 The Analyzer Agent
4.4 The Learning Algorithm
4.4.1 Random Walk Experiment
4.4.2 Learning Algorithm Extensions
4.4.3 Algorithm Complexity
5 Evaluation
5.1 Experiment Overview
5.2 Experiment Setup
5.3 Results
5.3.1 Local Optima Problem
5.3.2 Non-greedy Action Problem
5.3.3 Penalty Accumulation Problem
6 Related Work
6.1 Existing Semi-structural IE Systems
6.1.1 Ashish and Knoblock’s Wrapper Generation Toolkit
6.1.2 Rapper: A Wrapper Generator with Linguistic Knowledge
6.1.3 Wrappers in the TSIMMIS System
6.1.4 The Webfoot preprocessor
6.1.5 The ShopBot Comparison Shopping Agent
6.1.6 The WYSIWYG Wrapper Factory (W4F)
6.1.7 Head-Left-Right-Tail (HLRT) and Related Wrappers
6.2 Discussion
7 Ethical Considerations
7.1 General Effects of Knowledge Management
7.2 Intellectual Property Rights
7.3 Ethical Theories
8 Conclusions
8.1 Intelligent Navigation
8.2 Information Extraction
8.3 Ethics
8.4 Limitations
8.5 Future Work
Bibliography

Author: Arpteg, Anders

Source: Linköping University

Download URL 2: Visit Now

Leave a Comment