The Hidden Web Data Extraction Algorithm Based on Numerical Attributes

Scientific Journal of Information Engineering February 2016, Volume 6, Issue 1, PP.1-8

The Hidden Web Data Extraction Algorithm Based on Numerical Attributes SUN Yang#, LI Gui, HAN Zi-yang, LI Zheng-yu, SUN Ping Faculty of Information & Control Engineering, Shenyang Jianzhu University, Shenyang 110168, China #Email: 626714435@qq.com

Abstract When the user obtains the data of the background database through the web query interface, the number of the returned result is limited, and only partial data of the hidden database is acquired. The existing search engine technology is also difficult to effectively crawl all the data in the hidden database. To this end, a sorting algorithm based on numerical attributes is proposed for type of the numerical attributes of the background hidden database. By this algorithm, the total data tuples of the hidden database can be acquired with less query time. The theoretical analysis of the query cost of the algorithm is given, and the validity of the algorithm is verified by experiments. Keywords: Hidden Database; Numerical Attribute; Binary-shrink; Rank-shrink

基于数值属性的 web 隐藏数据抽取算法孙阳，李贵，韩子扬，李征宇，孙平沈阳建筑大学信息与控制工程学院辽宁沈阳 110168 摘

要：用户通过 web 查询接口获取后台数据库的数据时，由于返回结果元组数量是受限的，只能获取隐藏数据库中的

部分数据。现有的搜索引擎技术也很难有效的爬取隐藏数据库的全部数据。为此，针对后台隐藏数据库的数值属性类型，本文提出了基于数值属性的排序划分算法，通过该算法能够以较少的次数查询获取隐藏数据库数据的全部数据元组，并给出了算法查询代价的理论分析，通过实验验证了算法的有效性。关键词：隐藏数据库；数值属性；二元划分算法；排序划分算法

引言众所周知，现有的搜索引擎技术是通过超链接爬取互联网表层页面的部分数据。如今，越来越多的组织机构允许公众用户通过 web 查询接口访问后台数据库。用户可以通过 web 查询接口指定条件来进行查询，将查询提交给系统，然后在后端数据库中运行动态生成查询结果页面，并将结果返回给用户从而获得后台数据库的数据。然而，系统返回的元组个数 k 是受限制的，每次只能返回固定数量的元组，用户只能通过不断细化查询条件，发出若干次查询才能获得后台数据库中的全部元组。这些后台数据库的数据通常称为 web 隐藏数据库(Web Hidden Database)。爬取隐藏数据库的目的是对获取的数据进行分析、集成和挖掘等处理并提供相关增值服务。目前与该领域的相关研究工作主要有：文献[4]主要研究的是基于模板的数据抽取算法，利用 HTML 字符串的共现模式获取信息。文献[5,6]提出了 deep web 数据集成系统，它们侧重于集成模式的研究。文献[7]主要研究的是为 web 数据库的爬取找到更多的实体记录。文献[10,11]主要研究的是一种基于层次树的实体抽取机制，解决了 deep web 环境中的实体抽取问题等等。本文提出了一种适用于数值属性的排序划分算法。该算法在二元划分算法的基础上进行改进，提高了查询效率，减少了查询次数，降低了查询成本，并且在实验中也验证了排序划分算法优于二元划分算法这 -1http://www.sjie.org

Turn static files into dynamic content formats.

Create a flipbook