[1].高校数据精简整合系统范式管理研究福州大学信息管理研究所[J].信息化理论与实践,2017,(01):106-149.
点击复制
高校数据精简整合系统范式管理研究福州大学信息管理研究所(
)
《信息化理论与实践》[ISSN:2520-5862/CN:]
- 卷:
-
- 期数:
-
2017年01期
- 页码:
-
106-149
- 栏目:
-
- 出版日期:
-
2017-07-27
文章信息/Info
- 作者:
-
-
福州大学信息管理研究所
- Author(s):
-
-
Reserch on Campus Data Simplification & Integration System Paradigm Management
-
- 关键词:
-
数据精简; 数据整合; 共享池
- Keywords:
-
Data simplification; Data integration; Sharing Pool
- 摘要:
-
高校数据集成是目前高校信息共享工程中的核心构成,而信息集成前的异构数据含有大量冗余数据,这些冗余数据在数据库中以重复字段的形式存在,在存储系统中以重复文件、重复备份数据的形式存在。重复数据劣化数据质量,进而降低信息决策的准确性和增加成本耗费。本文以高校信息系统的数据库属性级、记录级的重复数据以及存储子系统中的文件级重复数据和备份系统中的重复数据为研究对象,提出了一套解决方案并开展研究。
在深入分析了我校的数据分布与数据结构,在对相似度算法、聚类算法、SOM网络、BP神经网络、数据库原理、哈希函数、Rabin指纹、布隆过滤器以及C++进行深入理解之后。本课题在我校现有的各部门的异构数据库数据源和存储设备的基础上建立了高校数据精简整合系统范式。
在不含有主键的数据库精简范式方面,由于高校有一些历史数据是不含有主键的,由于无主键的数据的表内记录有重复的可能,所以要对这些重复记录进行精简。步骤包括了:预处理、求字段相似度、重复记录检测、重复记录处理。
在含有主键的数据库记录精简范式方面,分为注入流程和精简流程,注入流程通过使用SOM网络进行字段类型的匹配,在此基础上使用BP网络实现相似字段的匹配,通过SOM-BP的配合完成了异构数据库与共享池预留字段的对应。将异构数据库数据注入共享池后,根据表与表之间的字段对应的不同情况采取不同的策略对异常记录进行检测,这些异常数据最后提供给人工筛查,将筛查的结果整合成一张完整的表。
在存储子系统的冗余数据精简方面,在基于哈希算法、Rabin指纹、布隆过滤器算法的基础上,根据高校存储系统中的冗余数据有文件级和字节级两种不同的粒度的特点,分为信息系统间的文件级重复数据精简和备份数据精简两个范式模型。
最后,通过实验,验证了数据库精简范式的查全率和查准率和储存子系统精简的压缩率。
- Abstract:
-
Campus data integration is the core component of the current campus information sharing project. There is a lot of redundant data existing in heterogeneous data before information integration process. In database, These redundant data exists in the form of repetitive field. In the storage system,It exists in the form of repetitive files. Duplicate data directly deteriorates the quality of the data, and then affect the accuracy of the information decision and the cost input. Regrading attribution & record level duplicate data of DBMS, file level duplicated data in storage sub-system and duplicated data in backup system as the research object, this paper put forward a set of solution and carry out the research.
After thoroughly analyzing the data distribution and in-depth understanding data structure, similarity algorithm, clustering algorithm, SOM network, as well as the principle of BP neural network, database, Hash, Rabin Fingerprint, Bloom Filter,C++. This topic established the campus data simplification & integration paradigm in the base of heterogeneous database data sources and storage device which distributes in each departments.
In the aspect of simplification paradigm of databases without primary key , since some campus historical data contains no primary key, there probably exists repeated records which need to be simplified. The simplification step includes: Pretreatment, Field similarity calculation, Duplicate records detection, Treatment of duplicate records.
In the aspect of simplification paradigm of databases with primary key. The procedure can be divided into injection process and simplification process. In the injection process, the field types were matched by SOM network, then on this basis, similar fields were matched by BP neural network. Through coordination with SOM neural network and BP neural network, the fields in heterogeneous database and Sharing pool were matched correspondingly. After the data of heterogeneous database inject into the Sharing pool, abnormal records were detected in different way according to the situation of field matching. These abnormal data will finally be transfered to artificial screening and then integrated into a new list as a result.
In the aspect of storage subsystem redundancy data simplification, basing on hash algorithm, Rabin Fingerprint, bloom filter algorithm, according to the different granularity characteristics of campus storage system redundant data in file level and byte level, the File level simplification within information systems and backup data simplification paradigm were put forward.
Finally,The recall rate and precision ratio of data simplification system and the compress ratio of storage system simplification system were verified by experiments
更新日期/Last Update:
2019-03-01