Wrapper generation and HTML Reduction
1 Wrapper Generation and HTML Reduction Yu Li
Outline ●网页抽取问题 ● SGWrap System ●HTML的问题 ●HTML约简 ○基本想法 O问题的定义和目标 ○页面模型 ○算法设计 ● Future work
2 Outline ⚫ 网页抽取问题 ⚫ SGWrap System ⚫ HTML的问题 ⚫ HTML约简 基本想法 问题的定义和目标 页面模型 算法设计 ⚫ Future work
页面抽取的问题 ●Wveb上存在大量的数据,以半结构化的 HTML页面形式存在 ●Wveb数据集成需要将半结构化的数据转换 成为结构化的数据 ●页面抽取的任务:按照用户要求,将半结 构化的Web数据转换成为结构化数据 ●完成页面抽取任务的程序通常叫做 wrapper
3 页面抽取的问题 ⚫Web上存在大量的数据,以半结构化的 HTML页面形式存在 ⚫Web数据集成需要将半结构化的数据转换 成为结构化的数据 ⚫页面抽取的任务:按照用户要求,将半结 构化的Web数据转换成为结构化数据 ⚫完成页面抽取任务的程序通常叫做wrapper
页面抽取问题 Alternatively you can view Contact, or see the Overview Name Detail Platform: java Purpose: indexing Availability: source Platform: UNIX Ahoy!the Homepage Fil Purpose: maintenance Availability: none -《> robot P !i mapping <> Platform? wrapper
4 页面抽取问题 mapping wrapper
页面抽取问题 ●页面抽取的工作可以通过 ○手工编写 Wrapper:使用传统语言,将 mapping"硬”编码在 Wrapper程序中 ○借助工具生成 Wrapper:通过计算机辅助生成 wrapper程序 ●抽取规则、交互方式、维护 O完全自动进行 ●页面结构的划分、 Annotation
5 页面抽取问题 ⚫页面抽取的工作可以通过 手工编写wrapper:使用传统语言,将 mapping“硬”编码在wrapper程序中 借助工具生成wrapper:通过计算机辅助生成 wrapper程序 ⚫抽取规则、交互方式、维护 完全自动进行 ⚫页面结构的划分、Annotation
SGWrap system o SGWrap= Schema Guided Wrapper Generation SGWrap system interact generate Wrapper Program run HTML page data
6 SGWrap System ⚫ SGWrap=Schema Guided Wrapper Generation SGWrap System interact Wrapper Program generate HTML page run data
SGWrap System o SGWrap mainly consists of three parts O SGWrap Runtime(Runtime, for short), which provides service to access our algorithms for web page content extraction It acts as the underlying functional layer of whole system and if you want to reuse or integrate your wrapper you also need reuse or ntegrate the runtime itself O SGWrap Compiler(Compiler, for short), which can compile SGWrap rules into wrapper in both source code form and bytecode form It does something like translation and the generated source code is human readable and can be modify to fulfill you special need. The bytecode is just compiled with help of Javas compiler javac. exe O Visual SGWrap, a visual tool to generate rules. It just need you interact with it by simple selecting and clicking operation, then it can calculate out the proper rules
7 SGWrap System ⚫ SGWrap mainly consists of three parts. SGWrap Runtime (Runtime, for short), which provides service to access our algorithms for web page content extraction. It acts as the underlying functional layer of whole system and if you want to reuse or integrate your wrapper you also need reuse or integrate the Runtime itself. SGWrap Compiler (Compiler, for short), which can compile SGWrap rules into wrapper in both source code form and bytecode form. It does something like translation and the generated source code is human readable and can be modify to fulfill you special need. The bytecode is just compiled with help of Java's compiler javac.exe. Visual SGWrap, a visual tool to generate rules. It just need you interact with it by simple selecting and clicking operation, then it can calculate out the proper rules
SGWrap System -basic usage 口×」 x e 2 Address: D: \Robots. htm Alternatively you can view Contact, or see the Overvie Detail Platform: java Purpose: indexing Availability: source Plat form. UNIX Ahoy The Homepage Finder Purpose: maintenance Availability: none Schema Rule Open DTD Add Mapping Remove Mapping Generate Rule Save Rules 日-<> Web robots DataItem i der DataPath /HTML/BODY/TABLE/TBODY/TR[1]/TD [O]/A MetaData[None a Functi on><I none
8 SGWrap System – basic usage
SGWrap system basic usage o3 Steps O Design Rule by Using Visual SGWrap O Compile rule into Program by Using SGWrapC OTest and Apply Wrapper by Using SGWrap (Runtime) o There is a tutorial at http://idke.ruc.educn/sawrap/doc/a-10 Minutes-Tutorial. html(also in documentation of each installation)
9 SGWrap System – basic usage ⚫3 Steps Design Rule by Using Visual SGWrap Compile Rule into Program by Using SGWrapC Test and Apply Wrapper by Using SGWrap (Runtime) ⚫There is a tutorial at http://idke.ruc.edu.cn/sgwrap/doc/A-10- Minutes-Tutorial.html (also in documentation of each installation)
Welcome to http://idkeruc.educn/sgwrap OHomepage of SGTrap System-lozilla Firefor 回 文件)编辑)查看转到G)书签0)工具T)帮助0 ERSI SGWrap(schema Guided Wrapper Generation) System Homepage Introduction News Updates Download Document I Background History Publications Developer ContactAcknowledgement What is SGWrap System Schema Gui ded Wrapper Generation System(SGWrap, for short) is a toolkit for web page nformation extraction. It can semi-automatically generate programs called wrapper built from extraction rules through user interactions. A wrapper for a set of web pages is a program used to extract contents from the pages and output strutured data for further processing. A wrapper, materialized as a java program by sgWrap system, for some certain pages can be easily generated using the visual sgWrap tool of the system and can be reused or integrated in many information systems
10 Welcome to http://idke.ruc.edu.cn/sgwrap