具線上學習之擷取系統和其自動維護機制

摘要

目前網際網路上遽增的資訊導致使用者越來越常利用擷取程式(wrapper)來擷取網站資料。擷取程式的功能在於擷取網頁的資訊來源,並將其儲存為根據使用者所定義的格式,以方便將處理過後的資料做進一步的應用。本論文提出兩個新方法,第一種是以訊號化為基礎,找出使用者標示範例與網頁的關連性特徵,此方法本論文稱為「以長條圖及標籤名稱分布之關連性係數」,第二種是將一個網頁上的每個標籤(tag)視為一個數值重量,並計算區域重心的位置,最後由這些區域重心位置值可看出此網頁內每個資料間的分佈情形,此方法本論文稱為「區域重心法」。此外,本系統加入一個以適應共振理論演算法(adaptive resonance theory, ART)為基礎可自我學習及修正擷取規則的機制,使舊有擷取規則能不斷適應新網頁的變化。並藉由本體論(ontology)的觀念進一步整合出各網站間所包含的資訊,本論文也提出一個以類神經網路為基礎計算字義相似度的方法。另一方面,因為網際網路資訊變動快速並持續增加,如此可能造成既有的網站包覆程式因此失去效用,所以必須時常對其做維護更新,甚至重新改寫整個網站包覆程式。在本論文中,我們提出一個使用數位濾波器方法為基礎的自動化維護機制,重新產生一個正確的網站包覆程式,並傳送提醒訊息給系統發展者。


A Novel On-Line Learning Wrapper System and Its Automatic Maintenance Mechanism

Abstract

The amount of information available on the World Wide Web has increased dramatically in recent years; however, many information resources are formatted for human browsing rather than for software programs. It is a demanding task to develop a tool to automatically extract information from semi-structured Web information sources to increase the utility of the Web for value-added services. This kind of tools is usually called wrapper. In this paper, we develop two methods based on signals to implement the wrapper. The first one is called” histogram and tag name-based correlation coefficient”. The method can discover correlation features between the template which the user marks and webpage, and implement the extraction system. In our method, templates for records with different tag structures will be incrementally generated by an ART-like algorithm, which follows the basic idea of the ART1 algorithm. Then records in a Web page can then be efficiently detected by using the generated templates via matching. The second method we propose is that we see every tag in a webpage having a weight, and then we can compute the area barycenter for it. Thus, after recording all the area barycenters, we will find the distribution can help us recognize the datas we want. After that, we propose an ontology-based method to integrate the information extracted from separate wrapped web sources by evaluating the similarities of the attributes between them. In this paper, we also propose a neural network-based approach for measuring semantic similarity between words.

Since the WWW is extremely dynamic and continually evolving, which results in frequent changes in the structures of Web documents, wrappers may not work as they did before. In this paper, we propose a filtering approach to implementing an automatic wrapper maintenance mechanism. The basic idea of the proposed method is to use a band-pass filter to automatically locate the contents of interest and then regenerate new templates of records in order to construct a new and correct wrapper.