CN111459908A - A data lake-based multi-source heterogeneous ecological environment big data processing method and system - Google Patents
- ️Tue Jul 28 2020
技术领域technical field
本发明涉及计算机大数据处理领域,具体涉及一种基于数据湖的多源异构大数据的处理方法及其系统。The invention relates to the field of computer big data processing, in particular to a data lake-based multi-source heterogeneous big data processing method and system.
背景技术Background technique
近年来,随着物联网、遥感、云计算、移动智能设备等技术的快速发展,生态环境数据呈井喷式增长。整体上看,生态环境大数据可以分为四大类:1) 基础支撑数据:基础地理、遥感影像、气候气象等数据;2)自然生态数据:农田生态系统、森林生态系统、草地生态系统、荒漠生态系统、沼泽生态系统等方面的数据;3)环境监测数据:水环境、大气环境、土壤环境、噪声环境、核辐射环境等方面的数据;4)人文社会数据:经济发展、基础设施、能源消耗、公众参与、网络舆情等方面的数据。In recent years, with the rapid development of technologies such as the Internet of Things, remote sensing, cloud computing, and mobile smart devices, ecological environment data has grown exponentially. On the whole, the ecological environment big data can be divided into four categories: 1) Basic supporting data: basic geography, remote sensing images, climate and meteorological data; 2) Natural ecological data: farmland ecosystems, forest ecosystems, grassland ecosystems, Data on desert ecosystems, swamp ecosystems, etc.; 3) Environmental monitoring data: data on water environment, atmospheric environment, soil environment, noise environment, nuclear radiation environment, etc.; 4) Human and social data: economic development, infrastructure, Data on energy consumption, public participation, online public opinion, etc.
生态环境大数据有其独有的特征:1)生态环境大数据具有“空天地一体”的巨大数据量。从数据规模来看,生态环境数据量也已从TB级别跃升到PB级别。 2)生态环境大数据的类型、来源和格式具有复杂多样性。生态环境数据在内容上包括水、土、大气等多方面的数据;从地域上来讲,包括全球各个尺度,如海洋、森林、湿地等各类生态系统的数据;从数据来源上说,有来自于气象、水利、国土、农业、林业、交通、社会经济等不同部门的各种数据;从数据结构上说,有结构化及半结构化的政府统计数据、非结构化的环境文本数据、二进制的遥感卫星影像数据等多种结构的数据。3)生态环境大数据具有较强的空间地理特征,对实时性与空间精度都有更高的要求,诸如自然灾害信息、环境污染状况、交通拥堵情况等无一例外。因此,一般大数据的处理及监控方法无法满足生态环境大数据使用过程中运算响应速度、扩展性、灵活性等方面的需求。Eco-environmental big data has its own unique characteristics: 1) Eco-environmental big data has a huge amount of data that “integrates space, space and earth”. From the perspective of data scale, the amount of ecological environment data has also jumped from TB level to PB level. 2) The types, sources and formats of ecological environment big data are complex and diverse. Ecological environment data includes data on water, soil, atmosphere and other aspects in content; in terms of geography, it includes data on various global scales, such as oceans, forests, wetlands and other ecosystems; in terms of data sources, there are In terms of data structure, there are structured and semi-structured government statistical data, unstructured environmental text data, binary data Remote sensing satellite image data and other data of various structures. 3) Eco-environmental big data has strong spatial and geographic characteristics, and has higher requirements for real-time and spatial accuracy, such as natural disaster information, environmental pollution, and traffic congestion without exception. Therefore, general big data processing and monitoring methods cannot meet the needs of computing response speed, scalability, and flexibility in the use of ecological environment big data.
当前,缺乏一种能够支持多源异构生态环境大数据的处理办法,用以解决生态环境大数据共享的问题。其主要难点在于:1)数据互通性问题:生态环境大数据来源几乎涵盖所有政府职能部门,这些部门互不连通,数据往往是以“数据孤岛”的形式存在。2)数据规范化的问题:数据不仅以单一的结构化形式存在,更多的数据是以半结构化、非结构化的形式呈现,缺乏统一的数据规范,存在大量异构数据。3)数据存储成本与运行性能问题:生态环境大数据存储在数据库或数据仓库中往往带来较高的存储成本,同时严重制约数据处理的运行速度。4)数据开放化的问题:生态环境数据开放总量偏低,大多为静态数据,且集中在经济发达、政府信息化基础和IT产业发展好的城市。At present, there is a lack of a processing method that can support multi-source heterogeneous ecological environment big data to solve the problem of ecological environment big data sharing. The main difficulties are: 1) The problem of data interoperability: The sources of ecological environment big data cover almost all government functional departments. These departments are not connected to each other, and the data often exists in the form of "data islands". 2) The problem of data normalization: Data not only exists in a single structured form, but more data is presented in semi-structured and unstructured forms. There is a lack of unified data specification, and there are a large number of heterogeneous data. 3) Data storage cost and operational performance issues: The storage of ecological environment big data in databases or data warehouses often brings high storage costs, and at the same time seriously restricts the speed of data processing. 4) The problem of open data: The total amount of open ecological environment data is relatively low, most of which are static data, and are concentrated in cities with developed economy, government informatization foundation and well-developed IT industry.
综上所述,迫切需要一种基于数据湖的多源异构生态环境大数据处理方法,对生态环境大数据进行标准化处理以及推动同类数据的集成处理与监控。In summary, there is an urgent need for a multi-source heterogeneous ecological environment big data processing method based on the data lake, which can standardize the ecological environment big data and promote the integrated processing and monitoring of similar data.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本发明提供了以下技术方案:In view of this, the present invention provides the following technical solutions:
一方面,本发明提供了一种基于数据湖的多源异构生态环境大数据处理系统,所述系统包括:In one aspect, the present invention provides a multi-source heterogeneous ecological environment big data processing system based on a data lake, the system comprising:
生态环境数据采集层、生态环境数据清洗层、生态环境数据存储层、生态环境数据处理层、生态环境数据管理层;Ecological environment data collection layer, ecological environment data cleaning layer, ecological environment data storage layer, ecological environment data processing layer, ecological environment data management layer;
所述生态环境数据采集层用于采集生态环境的原始数据,所述原始数据包括生态环境数据、生态环境元数据;所述生态环境数据采集层包括元数据采集模块、数据采集模块,所述元数据采集模块用于采集多种来源多种结构的生态环境元数据,所述数据采集模块用于采集多种来源多种结构的生态环境数据;The ecological environment data collection layer is used to collect the original data of the ecological environment, and the original data includes ecological environment data and ecological environment metadata; the ecological environment data collection layer includes a metadata collection module and a data collection module, and the metadata The data collection module is used to collect ecological environment metadata of various sources and structures, and the data collection module is used to collect ecological environment data of various sources and structures;
所述生态环境数据清洗层,用于对所述生态环境数据采集层获取的数据进行预处理及标准化;The ecological environment data cleaning layer is used to preprocess and standardize the data obtained by the ecological environment data acquisition layer;
所述生态环境数据存储层,用于对所述生态环境数据清洗层传输的数据进行分类分层存储;所述生态环境数据存储层包括数据分类存储模块、数据分层存储模块,所述数据分类存储模块用于将所述生态环境数据清洗层传输的标准化数据按照生态环境数据类别存储至不同类别的分类数据池中,所述数据分层存储模块,用于将所述分类数据池中的数据按照访问频次进行分层;The ecological environment data storage layer is used to classify and store the data transmitted by the ecological environment data cleaning layer; the ecological environment data storage layer includes a data classification storage module and a data layered storage module, and the data classification The storage module is used to store the standardized data transmitted by the ecological environment data cleaning layer into classified data pools of different categories according to the ecological environment data categories, and the data hierarchical storage module is used to store the data in the classified data pools Stratified according to access frequency;
所述生态环境数据处理层,用于一体化处理流批生态环境数据;The ecological environment data processing layer is used for integrated processing of batch ecological environment data;
所述生态环境数据管理层,用于对生态环境数据采集层、生态环境数据清洗层、生态环境数据存储层、生态环境数据处理层的生态环境数据采集、清洗、存储和处理过程进行监控。The ecological environment data management layer is used to monitor the ecological environment data collection, cleaning, storage and processing processes of the ecological environment data collection layer, the ecological environment data cleaning layer, the ecological environment data storage layer, and the ecological environment data processing layer.
优选地,所述生态环境数据清洗层包括数据缺失值处理模块、数据一致性检查模块、数据标准化处理模块;Preferably, the ecological environment data cleaning layer includes a data missing value processing module, a data consistency check module, and a data standardization processing module;
所述数据缺失值处理模块基于数据的重要性处理所述原始数据中的缺失数据;The data missing value processing module processes the missing data in the original data based on the importance of the data;
数据一致性检查模块用于对所述数据缺失值处理模块处理后的数据进行检查,并生成检查结果;The data consistency check module is used to check the data processed by the data missing value processing module, and generate a check result;
数据标准化处理模块用于将所述数据一致性检查模块检查后的数据进行标准化处理,实现生态环境数据的映射。The data standardization processing module is used for standardizing the data checked by the data consistency checking module to realize the mapping of ecological environment data.
优选地,所述数据标准化处理模块基于数据的主题标签分类与数据特征,构建生态环境数据标准,并依据所述生态环境数据标准将所述数据一致性检查模块检查后的数据转换为统一的数据格式,并存储至对应的分类数据池,实现生态环境数据的映射。Preferably, the data standardization processing module constructs an ecological environment data standard based on the subject label classification and data characteristics of the data, and converts the data checked by the data consistency checking module into unified data according to the ecological environment data standard format, and store it in the corresponding classified data pool to realize the mapping of ecological environment data.
优选地,所述数据分层存储模块将分类数据池中的数据按照访问频次分为冷数据、热数据,所述热数据包括永久性热数据、周期性热数据,所述永久性热数据长期存储在PostgreSQL数据库中,所述周期性热数据的访问频次达到或超过预设阈值时,迁移至PostgreSQL数据库中,当所述周期性热数据访问频次低于预设阈值时,数据迁回至数据湖;所述冷数据存储在所述数据湖中。Preferably, the data hierarchical storage module divides the data in the classified data pool into cold data and hot data according to the access frequency, the hot data includes permanent hot data and periodic hot data, and the permanent hot data is long-term. Stored in the PostgreSQL database, when the access frequency of the periodic hot data reaches or exceeds a preset threshold, it is migrated to the PostgreSQL database, and when the access frequency of the periodic hot data is lower than the preset threshold, the data is migrated back to the data lake; the cold data is stored in the data lake.
优选地,所述环境数据管理层记录所述生态环境数据采集层、生态环境数据清洗层、生态环境数据存储层、生态环境数据处理层的各执行过程的日志信息,以及外部访问日志信息。Preferably, the environmental data management layer records log information of each execution process of the ecological environment data collection layer, the ecological environment data cleaning layer, the ecological environment data storage layer, and the ecological environment data processing layer, as well as external access log information.
此外,本发明还提供了一种基于数据湖的多源异构生态环境大数据处理方法,所述方法包括:In addition, the present invention also provides a multi-source heterogeneous ecological environment big data processing method based on a data lake, the method comprising:
S1、生态环境数据采集模块采集生态环境原始数据,并存储至数据湖中的原始数据池,所述原始数据包括生态环境数据、生态环境元数据;S1, the ecological environment data collection module collects the original data of the ecological environment, and stores it in the original data pool in the data lake, and the original data includes the ecological environment data and the ecological environment metadata;
S2、对所述原始数据进行清洗,包括缺失数据处理、一致性检查,并基于清洗后的数据、环境数据标准化处理规则,形成标准化环境数据,存储至标准化数据池;S2, cleaning the original data, including missing data processing, consistency checking, and forming standardized environmental data based on the cleaned data and environmental data standardization processing rules, and storing them in a standardized data pool;
S3、依据生态环境数据分类规则,构建生态环境数据分类数据池,提取S2 中所述标准化数据池的标准化环境数据,将其分类存储至所述生态环境数据分类数据池;S3, constructing an ecological environment data classification data pool according to the ecological environment data classification rules, extracting the standardized environmental data of the standardized data pool described in S2, and classifying and storing it in the ecological environment data classification data pool;
S4、设置访问频次阈值,基于所述阈值构建各个生态环境数据分类数据池中的热数据层和冷数据层,当访问频次达到或超过所述阈值时,将被访问数据迁移至热数据层,否则不迁移,依旧存储在冷数据层;对于永久性热数据,则长期存储在热数据层;S4, setting an access frequency threshold, constructing a hot data layer and a cold data layer in each ecological environment data classification data pool based on the threshold, and when the access frequency reaches or exceeds the threshold, the accessed data is migrated to the hot data layer, Otherwise, it will not be migrated, and it will still be stored in the cold data layer; for permanent hot data, it will be stored in the hot data layer for a long time;
S5、将生态环境流式数据与生态环境批量数据进行一体化处理,依据外部查询需求索索所需数据。S5. Integrate the ecological environment stream data and the ecological environment batch data, and obtain the required data according to external query requirements.
优选地,所述S2中,所述一致性检查针对经过缺失数据处理后的数据进行检查,并记录检查结果,针对不一致问题生成数据清单;所述一致性检查包括逻辑合理性检查、数据超范围检查。Preferably, in the S2, the consistency check is performed on the data after missing data processing, and the check result is recorded, and a data list is generated for the inconsistency problem; the consistency check includes logical rationality check, data out-of-range examine.
优选地,所述S2中,生成标准化环境数据具体包括:依据清洗后的数据的主题标签与数据特征,构建环境数据标准,结合所述环境数据标准映射所述清洗后的数据,并转换为统一格式;所述环境数据标准包括用语标准、域标准以及代码标准。Preferably, in the step S2, generating standardized environmental data specifically includes: constructing environmental data standards according to the subject labels and data features of the cleaned data, mapping the cleaned data in combination with the environmental data standards, and converting them into unified data Format; the environmental data standards include terminology standards, domain standards, and code standards.
优选地,所述S3中,若标准化环境数据存储至生态环境数据分类数据池失败,则发送失败信息至原始数据池,并重新将对应的原始数据进行标准化处理,并再次存储至生态环境数据分类数据池;Preferably, in the S3, if the standardized environmental data fails to be stored in the ecological environment data classification data pool, the failure information is sent to the original data pool, and the corresponding original data is re-standardized and stored in the ecological environment data classification again. data pool;
所述生态环境数据分类数据池包括基础支撑数据池、自然生态数据池、环境监测数据池、人文社会数据池。The ecological environment data classification data pool includes a basic support data pool, a natural ecological data pool, an environmental monitoring data pool, and a humanities and social data pool.
优选地,所述S4进一步包括:所述热数据包括永久性热数据、周期性热数据、突发性热数据;Preferably, the S4 further includes: the hot data includes permanent hot data, periodic hot data, and sudden hot data;
所述周期性热数据在结束频繁访问周期后,回迁至冷数据层;所述突发性热数据的访问频次达到所述阈值时,将所述突发性热数据迁移至PostgreSQL数据库中。The periodic hot data is re-migrated to the cold data layer after the frequent access period ends; when the access frequency of the burst hot data reaches the threshold, the burst hot data is migrated to the PostgreSQL database.
与现有技术相比,本发明技术方案本基于公有云搭建生态环境数据湖,采集多源异构“空天地一体”的生态环境原始数据与元数据,通过生态环境数据缺失值处理、一致性检查以及标准化处理等数据清洗步骤,将原始数据转换为标准化数据,设计构建数据分类与分层存储机制,结合预设分类与分层存储机制,将标准化数据按照类别存储到不同的分类数据池中,在不同分类数据池中将标准化数据按照热度存储到不同级别的数据层中,从而规范化生态环境数据,实现环境数据之间的互联互通,能够对环境数据进行有效的监控和使用,提升了环境数据的存取与分析效率,大幅度降低了存储成本,为生态环境领域的管理建设提供数据支撑。Compared with the prior art, the technical solution of the present invention is to build an ecological environment data lake based on the public cloud, collect multi-source heterogeneous ecological environment original data and metadata of "air, space and earth integration", and process the missing value of ecological environment data to ensure consistency. Data cleaning steps such as inspection and standardized processing, convert raw data into standardized data, design and build data classification and hierarchical storage mechanisms, combine preset classification and hierarchical storage mechanisms, and store standardized data in different classified data pools according to categories , in different classified data pools, the standardized data is stored in different levels of data layers according to the heat, so as to standardize the ecological environment data, realize the interconnection between environmental data, and effectively monitor and use the environmental data, improving the environment. The efficiency of data access and analysis greatly reduces storage costs and provides data support for the management and construction of the ecological environment.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。显而易见地,下面描述中的附图仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that are required in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only some embodiments of the present invention, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.
图1为本发明实施例基于数据湖的多源异构生态环境大数据管理系统架构示意图;1 is a schematic diagram of the architecture of a multi-source heterogeneous ecological environment big data management system based on a data lake according to an embodiment of the present invention;
图2为本发明实施例基于数据湖的多源异构生态环境大数据管理办法较佳实施例的流程图。FIG. 2 is a flowchart of a preferred embodiment of a method for managing multi-source heterogeneous ecological environment big data based on a data lake according to an embodiment of the present invention.
具体实施方式Detailed ways
下面结合附图对本发明实施例进行详细描述。应当明确,所描述的实施例仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that the described embodiments are only some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
本领域技术人员应当知晓,下述具体实施例或具体实施方式,是本发明为进一步解释具体的发明内容而列举的一系列优化的设置方式,而该些设置方式之间均是可以相互结合或者相互关联使用的,除非在本发明明确提出了其中某些或某一具体实施例或实施方式无法与其他的实施例或实施方式进行关联设置或共同使用。同时,下述的具体实施例或实施方式仅作为最优化的设置方式,而不作为限定本发明的保护范围的理解。Those skilled in the art should know that the following specific embodiments or specific implementation manners are a series of optimized setting methods listed in the present invention to further explain the specific content of the invention, and these setting methods can be combined with each other or They are used in relation to each other, unless it is explicitly stated in the present invention that some or a specific embodiment or implementation cannot be set up or used in conjunction with other embodiments or implementations. At the same time, the following specific embodiments or implementations are only used as an optimal setting manner, and are not used as an understanding to limit the protection scope of the present invention.
实施例1:Example 1:
下面将结合本发明实施例的附图,清楚、完整地描述本发明实施例中的技术方案,显然,所描述的实施例仅是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
如图1所示,是本发明基于数据湖的多源异构生态环境大数据集成系统架构示意图。As shown in FIG. 1 , it is a schematic diagram of the architecture of the multi-source heterogeneous ecological environment big data integration system based on the data lake of the present invention.
本发明旨在提供一种基于数据湖的多源异构生态环境大数据集成系统,包括生态环境数据采集层L1、生态环境数据清洗层L2,生态环境数据存储层L3,生态环境数据处理层L4和生态环境数据管理层L3。The present invention aims to provide a multi-source heterogeneous ecological environment big data integration system based on a data lake, including an ecological environment data collection layer L1, an ecological environment data cleaning layer L2, an ecological environment data storage layer L3, and an ecological environment data processing layer L4 and ecological environment data management layer L3.
所述生态环境数据采集层L1,由元数据采集模块L11与数据采集模块L12 组成。The ecological environment data collection layer L1 is composed of a metadata collection module L11 and a data collection module L12.
所述元数据采集模块L11,用于采集多种来源多种结构的生态环境元数据。The metadata collection module L11 is used to collect ecological environment metadata of various sources and structures.
所述数据采集模块L12,用于采集多种来源多种结构的生态环境数据。The data collection module L12 is used to collect ecological environment data of various sources and structures.
具体的,融合公有云函数计算(Function Compute)技术与本地定期脚本执行技术,自动获取生态环境流式数据与批量数据;将程序代码上传至公有云,使用FunctionCompute,结合公有云有状态容器与公有云无状态容器,事件驱动的全托管式自动采集生态环境流式数据至数据湖;编译生态环境批量数据采集脚本,定期执行此脚本,自动采集生态环境批量数据至数据湖原始数据池;支持自动获取生态环境元数据,支持生态环境数据实时全量、增量更新。Specifically, it integrates public cloud function computing (Function Compute) technology and local periodic script execution technology to automatically obtain ecological environment streaming data and batch data; upload program code to public cloud, use FunctionCompute, and combine public cloud stateful containers with public cloud Cloud stateless container, event-driven fully managed automatic collection of ecological environment streaming data to the data lake; compile the ecological environment batch data collection script, execute this script regularly, and automatically collect the ecological environment batch data to the original data pool of the data lake; support automatic Obtain ecological environment metadata and support real-time full and incremental update of ecological environment data.
所述元数据是一组用来描述数据的信息组或数据组,例如,城市名称、空气质量指数、平均每小时CO值、平均每小时NO2值、平均每小时O3值等是描述城市大气监测数据的一组数据,即城市大气监测数据元数据。应用生态环境元数据构建生态环境数据检索体系,实现数据的快速定位与获取。同时,基于生态环境元数据,实现对新采集多源异构生态环境数据的质量评估与数据融合。The metadata is a set of information groups or data groups used to describe data, for example, city name, air quality index, average hourly CO value, average hourly NO2 value, average hourly O3 value, etc. A set of data of atmospheric monitoring data, namely metadata of urban atmospheric monitoring data. Use ecological environment metadata to build an ecological environment data retrieval system to achieve rapid data location and acquisition. At the same time, based on ecological environment metadata, the quality assessment and data fusion of newly collected multi-source heterogeneous ecological environment data is realized.
所述生态环境数据清洗层L2,包括数据缺失值处理模块L21、数据一致性检查模块L22、数据标准化处理模块L23。The ecological environment data cleaning layer L2 includes a data missing value processing module L21, a data consistency checking module L22, and a data standardization processing module L23.
所述数据缺失值处理模块L21,用于处理所述生态环境数据采集层L1中自动获取的原始数据中的缺失数据。The data missing value processing module L21 is used for processing missing data in the raw data automatically acquired in the ecological environment data collection layer L1.
具体的,针对不同类别生态环境数据,所述数据缺失值处理模块L21根据数据的重要性,选择采用直接删除(包括整例删除、变量删除、成对删除等) 的处理方式、或自动填充(包括全局变量填充、中心度填充、同组均值填充、最优可能值填充等)的处理方式、或选择人工填补的处理方式。Specifically, for different types of ecological environment data, the data missing value processing module L21 selects the processing method of direct deletion (including whole case deletion, variable deletion, pair deletion, etc.), or automatic filling ( Including global variable filling, centrality filling, group mean filling, optimal possible value filling, etc.), or selecting manual filling.
所述数据一致性检查模块L22,用于检查所述数据缺失值处理模块L21中处理完成的生态环境数据是否合乎要求。The data consistency checking module L22 is configured to check whether the ecological environment data processed in the missing data value processing module L21 meets the requirements.
具体的,结合多源异构生态环境数据,按照生态环境数据主题标签设计数据规范,所述数据一致性检查模块L22基于所设计的数据规范,融合ETL技术判断数据是否存在字段过载、是否逻辑合理、是否超出给定值域范围等问题,完成生态环境数据一致性检查,并对检查结果进行记录,针对发现的不一致问题,列出数据清单,供进一步核对和纠正。Specifically, in combination with multi-source heterogeneous ecological environment data, the data specification is designed according to the theme label of the ecological environment data. Based on the designed data specification, the data consistency checking module L22 integrates the ETL technology to judge whether the data has field overload and whether it is logically reasonable. , whether it exceeds the range of the given value range, etc., complete the consistency check of the ecological environment data, record the inspection results, and list the data for further verification and correction for the inconsistencies found.
所述数据标准化处理层L23,用于所述数据一致性检查模块L22中多源异构生态环境数据的标准化处理,实现多源异构生态环境数据映射。The data standardization processing layer L23 is used for the standardization processing of the multi-source heterogeneous ecological environment data in the data consistency checking module L22, so as to realize the multi-source heterogeneous ecological environment data mapping.
具体的,根据生态环境数据主题标签分类与数据特征,所述数据标准化处理层L23构建生态环境数据标准:1)用语标准:定义生态环境领域数据用语标准,保持多源异构数据用语的一致性,提高管理效能。例如数据湖/数据库中存储的列名的标准(长度限制、不允许重复、统一的命名规则、环境标准词的使用等),若定义“城市名称”作为列名的标准用语,则将“城市的名称”“城市名字”等从列名用语中排除。2)域标准:定义列的性质,将列分为文字型、数字型、日期型和时间型四类,进一步分为城市名称、经纬度、指数、比率等,实现相关类型列的统一管理。3)代码标准:定义特殊域的代码值,为后续高性能、高效率地数据映射奠定基础,例如设定全球城市的id,实现同一个城市中不同类别的生态环境数据的快速映射。Specifically, according to the classification of ecological environment data subject labels and data characteristics, the data standardization processing layer L23 constructs ecological environment data standards: 1) Terminology standard: define the data terminology standard in the ecological environment field, and maintain the consistency of multi-source heterogeneous data terminology , improve management efficiency. For example, the standard of column names stored in the data lake/database (length limit, no repetition, unified naming rules, use of environmental standard words, etc.), if "city name" is defined as the standard term for column names, the "city name" "name", "city name", etc. are excluded from column name terms. 2) Domain standard: define the nature of the column, divide the column into four types: text type, number type, date type and time type, and further divide it into city name, longitude and latitude, index, ratio, etc., to achieve unified management of related types of columns. 3) Code standard: Define the code value of a special field to lay the foundation for subsequent high-performance and efficient data mapping, such as setting the id of a global city to achieve rapid mapping of different types of ecological environment data in the same city.
结合数据标准,映射多源异构生态环境数据,转换为统一且易于使用的标准化数据,并存储至标准化数据池。例如,使用离差标准化(如公式1)将部分环境监测值进行线性变化,即将监测值映射至[0,1]区间中,实现不同城市环境监测获得的监测值结果的统一化比对。最终解决生态环境数据“信息孤岛”问题,提高数据查询与分析的便捷性。Combined with data standards, map multi-source heterogeneous ecological environment data, convert it into unified and easy-to-use standardized data, and store it in a standardized data pool. For example, using dispersion standardization (such as formula 1) to linearly change part of the environmental monitoring values, that is, mapping the monitoring values to the [0,1] interval, to achieve a unified comparison of the monitoring value results obtained from environmental monitoring in different cities. Finally, the problem of "information island" of ecological environment data is solved, and the convenience of data query and analysis is improved.
对序列x1,x2,……xn进行变换,Transform the sequence x 1 , x 2 ,...x n ,
则新序列y1,y2,……yn∈[0,1]且无量纲。Then the new sequence y 1 , y 2 ,...y n ∈ [0,1] is dimensionless.
所述生态环境数据存储层L3,包括数据分类存储模块L31、数据分层存储模块L32。The ecological environment data storage layer L3 includes a data classification storage module L31 and a data hierarchical storage module L32.
具体的,通过数据湖技术分类分层存储所有数据类型的生态环境数据,包括结构化生态环境数据(关系系数据库数据)、半结构化生态环境数据(CSV、 XML、JSON等格式数据)、非结构化生态环境数据(文档类型数据)和二进制生态环境数据(图像、音频、视频等格式数据),解决多源异构生态环境数据存储带来的问题,极大程度上解决查询的响应时间与存储成本。Specifically, the data lake technology is used to classify and store ecological environment data of all data types in layers, including structured ecological environment data (relational database data), semi-structured ecological environment data (data in CSV, XML, JSON and other formats), non- Structured ecological environment data (document type data) and binary ecological environment data (image, audio, video and other format data), solve the problems caused by multi-source heterogeneous ecological environment data storage, and greatly solve the query response time and storage costs.
所述数据分类存储模块L31,用于将所述数据清洗层L2中的标准化数据按照生态环境数据类别存储至不同类别的数据池中。The data classification storage module L31 is configured to store the standardized data in the data cleaning layer L2 into data pools of different types according to ecological environment data categories.
具体的,结合生态环境数据主题分类标签,所述数据分类存储模块L31根据数据来源或采用支持向量机算法、基于随机森林的协同训练算法中的一种或多种,将生态环境标准化数据按数据归属类别存储至相应的分类数据池,所述分类数据池包括:基础支撑数据池、自然生态数据池、环境监测数据池、人文社会数据池。每项分类数据池中将大量的元数据与特定对象关联,提高了数据检索效率,同时解决海量全数据形态存储带来的问题,具有更高的可用性、容错能力以及可扩展性。Specifically, in combination with the subject classification labels of the ecological environment data, the data classification storage module L31, according to the data source or using one or more of the support vector machine algorithm and the random forest-based collaborative training algorithm, classifies the ecological environment standardized data by data The attribution category is stored in a corresponding classified data pool, and the classified data pool includes: a basic support data pool, a natural ecological data pool, an environmental monitoring data pool, and a humanities and social data pool. A large amount of metadata is associated with specific objects in each classified data pool, which improves the efficiency of data retrieval, and solves the problems caused by the storage of massive full data forms, and has higher availability, fault tolerance and scalability.
所述数据分层存储模块L32,用于将所述数据分类存储模块L31分类数据池中的数据按照访问频次进行分层。The data layered storage module L32 is configured to layer the data in the classified data pool of the data classification storage module L31 according to the access frequency.
具体的,结合固态驱动器(SSD)的混合存储机制,设定热数据占相应分类数据池数据的20%,即频次阈值;所述数据分层存储模块L32按照频次阈值,将所述数据分类存储模块L31形成的分类数据池中数据分层存储至冷数据存储层 L321或热数据存储层L322;在设定的单位时间内,当数据访问频次超过预先设置的频次阈值时,则触发数据迁移至热数据存储层L322,否则不触发数据迁移,数据保持存储于冷数据存储层L321。Specifically, combined with the hybrid storage mechanism of the solid-state drive (SSD), it is set that hot data accounts for 20% of the data in the corresponding classified data pool, that is, the frequency threshold; the data layered storage module L32 stores the data in categories according to the frequency threshold. The data in the classified data pool formed by the module L31 is hierarchically stored to the cold data storage layer L321 or the hot data storage layer L322; within the set unit time, when the data access frequency exceeds the preset frequency threshold, data migration is triggered to The hot data storage layer L322, otherwise data migration is not triggered, and the data remains stored in the cold data storage layer L321.
所述冷数据存储层L321,用于存储访问频次低于频次阈值的生态环境数据。The cold data storage layer L321 is used to store ecological environment data whose access frequency is lower than the frequency threshold.
所述热数据存储层L322,用于存储访问频次超过频次阈值的生态环境数据,包括永久性热数据、周期性热数据与突发性热数据。The thermal data storage layer L322 is used to store ecological environment data whose access frequency exceeds the frequency threshold, including permanent thermal data, periodic thermal data and sudden thermal data.
具体的,结合PostgreSQL数据库技术,所述热数据存储层L322将热数据迁移至PostgreSQL数据库中,在阈值触发时间段内,再次访问热数据时不需要调用数据湖,直接在PostgreSQL数据库中查询;永久性热数据长期存储在 PostgreSQL数据库中,不回迁至数据湖冷数据存储层L321中;当周期性热数据在被频繁访问周期中时,数据被存储至PostgreSQL数据库中,当一个周期结束时,数据回迁至数据湖冷数据存储层L321中;当突发性热数据被频繁访问期间触发数据迁移,短期存储至PostgreSQL数据库中,访问频次低于访问阈值时则回迁至数据湖冷数据存储层L321。Specifically, combined with the PostgreSQL database technology, the hot data storage layer L322 migrates the hot data to the PostgreSQL database. During the threshold trigger time period, the data lake does not need to be called when the hot data is accessed again, and the query is made directly in the PostgreSQL database; permanently The hot data is stored in the PostgreSQL database for a long time and is not moved back to the cold data storage layer L321 of the data lake; when the periodic hot data is in the frequently accessed cycle, the data is stored in the PostgreSQL database, and when a cycle ends, the data is stored in the PostgreSQL database. Back to the data lake cold data storage layer L321; when the sudden hot data is frequently accessed, data migration is triggered, and it is stored in the PostgreSQL database for a short period of time. When the access frequency is lower than the access threshold, it is back to the data lake cold data storage layer L321.
相对于已有冷热存储技术,本发明将多源异构的原始生态环境数据与冷数据存储于数据湖中,全无服务化(Serverless)结构,无需长期持有成本,完全的按需付费,增强了生态环境数据存储的灵活性与资源伸缩的方便性。将生态环境热数据存储于PostgreSQL数据库中,功能强大且开源免费的PostgreSQL 数据库能够直接使用SQL与数据湖进行交互,支持高并发的读写操作,为热数据的存储提供了读写性能的高效性和及时性。同时,PostgreSQL具有丰富的字典、数组、bitmap等数据类型,其空间数据库扩展PostGIS提供空间对象、空间索引、空间操作函数和空间操作符等空间信息服务功能,适用于“空天地一体”生态环境数据的表达与管理。因此,通过将数据湖与PostgreSQL相结合构建生态环境大数据分层存储模块,大幅度提高了数据访问效率,降低了数据存储成本。Compared with the existing cold and hot storage technology, the present invention stores multi-source heterogeneous original ecological environment data and cold data in the data lake, with a serverless structure, without long-term holding costs, and completely on-demand payment. , which enhances the flexibility of ecological environment data storage and the convenience of resource scaling. The ecological environment hot data is stored in the PostgreSQL database. The powerful, open-source and free PostgreSQL database can directly use SQL to interact with the data lake, supports high concurrent read and write operations, and provides high read and write performance for hot data storage. and timeliness. At the same time, PostgreSQL has rich data types such as dictionaries, arrays, and bitmaps. Its spatial database extends PostGIS to provide spatial information service functions such as spatial objects, spatial indexes, spatial operation functions, and spatial operators, and is suitable for "air-space-earth integration" ecological environment data. expression and management. Therefore, by combining the data lake and PostgreSQL to build a hierarchical storage module for ecological environment big data, the data access efficiency is greatly improved and the data storage cost is reduced.
所述生态环境数据处理层L4,包括数据流批一体化处理模块L41,用于一体化处理流批环境数据。The ecological environment data processing layer L4 includes a data stream batch integrated processing module L41, which is used for integrated processing of stream batch environmental data.
具体的,基于Apache开源的分布式流式处理框架Flink,所述数据流批一体化处理模块L41利用一套大数据处理引擎编写一个体系的代码,一体化处理多种数据来源(HDFS、本地文件系统、MapReduce文件系统、纯文本文件等)的流式数据(例如,大气、水、土壤、生态、核与辐射等多种环境要素及各种污染源全面感知和实时监控数据等)与批量数据(例如,重点生态功能区、自然保护区、生物多样性保护优先区等自然生态数据,区域、流域、行业等污染排放数据等)。利用Flink内置的ANSI标准的SQL接口,根据生态环境大数据流式数据与批量数据分析应用特征,自定义函数,并在SQL中执行定制化代码,通过接入事件流,融合各类环境监测等流式数据与人文社会等批量数据,实时分析数据并随着事件消费持续产生和更新结果,达到秒级甚至亚秒级的延迟,保证了生态环境数据的实时性、稳定性和共享性。Specifically, based on Apache's open source distributed stream processing framework Flink, the data stream batch integrated processing module L41 uses a set of big data processing engines to write a system code, and integrates processing multiple data sources (HDFS, local files, etc.) system, MapReduce file system, plain text files, etc.) streaming data (for example, various environmental elements such as atmosphere, water, soil, ecology, nuclear and radiation, and comprehensive perception and real-time monitoring data of various pollution sources, etc.) and batch data ( For example, natural ecological data such as key ecological function areas, nature reserves, priority areas for biodiversity protection, etc., and pollution discharge data such as regions, river basins, and industries, etc.). Using Flink's built-in ANSI standard SQL interface, according to the ecological environment big data streaming data and batch data analysis application characteristics, customize functions, and execute customized code in SQL, by accessing event streams, integrating various environmental monitoring, etc. Streaming data and batch data such as humanities and society, analyze data in real time and continuously generate and update results with event consumption, reaching second-level or even sub-second-level delay, ensuring the real-time, stability and sharing of ecological environment data.
利用Flink内置的ANSI标准的SQL接口调用Flink DataStream API处理生态环境流式数据,结合检查点机制和状态机制、水印机制、窗口和触发器等流处理机制,将生态环境流式数据进行聚合与处理,并持续地在应用程序和系统间移动。调用FlinkDataSetAPI将生态环境批量数据当作有限的流进行处理,即从某个时间点开始处理数据,在另一个时间点结束,输入的数据集不会随着时间而增长,计算结果只在结束时生成一次。结合用于调度和恢复的回溯法、用于散列和排序的特殊内存数据结构以及优化器等批处理机制,处理过程中数据不写入磁盘;在需要时,将一部分数据从内存溢出到磁盘上,从而高效地进行生态环境批量数据处理。最终实现生态环境流式数据与生态环境批量数据的一体化处理,降低外部平台/系统调用数据的延迟,达到秒级甚至亚秒级的延迟;支持外部平台/系统数据处理程序的高可维护性。Use Flink's built-in ANSI standard SQL interface to call Flink DataStream API to process ecological environment streaming data, combine checkpoint mechanism, state mechanism, watermarking mechanism, window and trigger and other stream processing mechanisms to aggregate and process ecological environment streaming data , and continuously move between applications and systems. Call FlinkDataSetAPI to process ecological environment batch data as a limited stream, that is, start processing data from a certain time point and end at another time point, the input data set will not grow with time, and the calculation result will only be at the end Generated once. Combined with backtracking for scheduling and recovery, special in-memory data structures for hashing and sorting, and batch processing mechanisms such as optimizers, data is not written to disk during processing; a portion of data is spilled from memory to disk when needed , so as to efficiently process ecological environment batch data. Ultimately realize the integrated processing of ecological environment streaming data and ecological environment batch data, reduce the delay of external platform/system calling data, and reach second-level or even sub-second-level delay; support the high maintainability of external platform/system data processing programs .
所述环境数据管理层L5,用于监控环境数据采集、清洗、存储和处理全过程。The environmental data management layer L5 is used to monitor the whole process of environmental data collection, cleaning, storage and processing.
具体的,所述生态环境数据管理层L5,从生态环境数据采集开始则启动使用,记录生态环境数据采集、生态环境数据清洗、生态环境数据存储、生态环境数据处理各个步骤日志信息;此外,同时记录外部平台/系统通过API访问查询环境数据的日志信息;解决生态环境数据集成管理全过程的数据质量与系统性能问题。Specifically, the ecological environment data management layer L5 starts to use from the ecological environment data collection, and records the log information of each step of ecological environment data collection, ecological environment data cleaning, ecological environment data storage, and ecological environment data processing; Record the log information of external platforms/systems accessing and querying environmental data through API; solve the data quality and system performance problems in the whole process of ecological environment data integration management.
实施例2:Example 2:
如图2所示,是本发明一种基于数据湖的生态环境大数据管理办法较佳实施例的流程图。As shown in FIG. 2 , it is a flow chart of a preferred embodiment of a data lake-based ecological environment big data management method of the present invention.
在本实施例中,基于数据湖的多源异构生态环境大数据管理方法的实现包括步骤S1-S5,下面结合具体生态环境数据,对基于数据湖的多源异构生态环境大数据管理方法的实现进行详细介绍:In this embodiment, the implementation of the multi-source heterogeneous ecological environment big data management method based on the data lake includes steps S1-S5. The following describes the multi-source heterogeneous ecological environment big data management method based on the data lake in combination with the specific ecological environment data. The implementation is described in detail:
步骤S1,生态环境数据采集模块L1事件驱动地全托管式自动采集各个数据源的多种结构生态环境原始数据,该数据包括生态环境数据与元数据,存储至数据湖中的原始数据池;原始数据池中的环境数据是从数据源采集获取的原始生态环境数据,并不做任何数据处理,保持数据的原始格式与形态,例如以文本形式存储全球环境问题舆情原始数据、以二维表形式存储气象数据。Step S1, the ecological environment data collection module L1 event-driven fully managed automatic collection of various structural ecological environment original data from each data source, the data includes ecological environment data and metadata, and stored in the original data pool in the data lake; the original data pool; The environmental data in the data pool is the original ecological environment data collected from the data source, without any data processing, keeping the original format and form of the data, such as storing the original data of public opinion on global environmental issues in text form, and in two-dimensional table form. Store weather data.
步骤S2,生态环境数据清洗层L2清洗生态环境原始数据,包括缺失数据的处理、一致性检查,分别构建不同类别环境数据标准化处理规则,形成标准化环境数据,存储至标准化数据池。具体的,缺失数据处理中针对不同类别生态环境数据,根据数据的重要性,选择采用直接删除的处理方式、或自动填充的处理方式、或选择人工填补的处理方式,例如,若气象数据中某一气温数据缺失,则可采用连续一个月内平均气温。Step S2, the ecological environment data cleaning layer L2 cleans the original ecological environment data, including the processing of missing data and consistency check, respectively constructs standardized processing rules for different types of environmental data, forms standardized environmental data, and stores it in the standardized data pool. Specifically, in the processing of missing data, for different types of ecological environment data, according to the importance of the data, choose the processing method of direct deletion, the processing method of automatic filling, or the processing method of manual filling. If the temperature data is missing, the average temperature for one consecutive month can be used.
数据一致性检查中基于所设计的数据规范,融合ETL技术判断数据是否存在字段过载、是否逻辑合理、是否超出给定值域范围等问题,完成生态环境数据一致性检查,并对检查结果进行记录,针对发现的不一致问题,列出数据清单,供进一步核对和纠正,例如,在气象数据中,北京市7-8月气温出现-5℃,则判断该数据存在问题,应进行记录,并进一步纠正。In the data consistency check, based on the designed data specification, the ETL technology is integrated to judge whether the data has problems such as field overload, whether it is logically reasonable, whether it exceeds the given value range, etc., to complete the ecological environment data consistency check, and record the check results. , for the found inconsistencies, list the data list for further checking and correction. For example, in the meteorological data, the temperature in Beijing from July to August is -5℃, it is judged that there is a problem with the data, it should be recorded, and further correct.
数据标准化处理中,根据生态环境数据主题标签分类与数据特征,构建环境数据标准,包括用语标准、域标准以及代码标准;结合数据标准,映射多源异构生态环境数据,转换为统一且易于使用的标准化数据,并统一存储至标准化数据池,例如,对于多源异构气象数据,统一气象数据描述字段,并将数据统一转换为ORC格式存储至标准化数据池,解决生态环境数据“信息孤岛”问题,提高数据查询与分析的便捷性。In the data standardization process, environmental data standards are constructed according to the classification of ecological environment data subject tags and data characteristics, including term standards, domain standards and code standards; combined with data standards, multi-source heterogeneous ecological environment data is mapped and converted into unified and easy-to-use For example, for multi-source heterogeneous meteorological data, unify the description fields of meteorological data, and uniformly convert the data into ORC format and store it in the standardized data pool to solve the "information island" of ecological environment data. problems, improve the convenience of data query and analysis.
步骤S3,生态环境数据存储层L3中生态环境数据分类存储模块L31构建生态环境数据分类数据池,并设计生态环境数据分类规则,提取标准化数据池中的标准化环境数据,将其分类存储至相应分类数据池;具体的,分类数据池包括基础支撑数据池、自然生态数据池、环境监测数据池、人文社会数据池。Step S3, the ecological environment data classification storage module L31 in the ecological environment data storage layer L3 constructs the ecological environment data classification data pool, designs the ecological environment data classification rules, extracts the standardized environmental data in the standardized data pool, and classifies and stores it into the corresponding classification. Data pools; specifically, classified data pools include basic support data pools, natural ecological data pools, environmental monitoring data pools, and humanities and social data pools.
进一步的,若某类生态环境数据成功存储至相应的分类数据池,原始数据池将收到成果提示信息,若存储失败,则原始数据池将收到失败提示信息,并重新将该类生态环境数据对应的原始数据进行标准化处理、分类存储至相应的分类数据池;例如,来自GDELT中的环境问题舆情数据属于人文社会数据,应该存储至人文社会数据池中;若该数据成功存储至人文社会数据池,原始数据池将收到成功提示信息,否则重新将该数据对应的原始数据进行标准化处理,存储至相应的分类数据池。Further, if a certain type of ecological environment data is successfully stored in the corresponding classified data pool, the original data pool will receive a notification of the result; if the storage fails, the original data pool will receive a notification of failure, and the ecological environment of this type will be reset. The original data corresponding to the data is standardized and stored in the corresponding classified data pool; for example, the public opinion data on environmental issues from GDELT belongs to the humanities and social data and should be stored in the humanities and social data pool; if the data is successfully stored in the humanities and social data pools The data pool and the original data pool will receive a successful notification message, otherwise the original data corresponding to the data will be standardized again and stored in the corresponding classified data pool.
步骤S4,生态环境数据环境数据存储层L3中数据分层存储模块L32构建每项分类数据池中生态环境数据热数据存储层和冷数据存储层,并设计生态环境数据访问频次阈值,将分类数据池中的标准化生态环境数据按热度存储至相应数据层。具体的,结合固态驱动器(SSD)的混合存储机制,设定热数据占相应分类数据池数据的20%,即频次阈值,根据不同生态环境分类数据池的访问频次情况,设定不同的频次阈值触发时间段,在相应时间段内,当数据访问频次超过预先设置的频次阈值时,则触发数据迁移至热数据存储层,否则不触发数据迁移,数据保持存储于冷数据存储层。Step S4, the data layered storage module L32 in the ecological environment data environment data storage layer L3 constructs the ecological environment data hot data storage layer and the cold data storage layer in each classified data pool, and designs the ecological environment data access frequency threshold, the classified data The standardized ecological environment data in the pool is stored in the corresponding data layer according to the heat. Specifically, combined with the hybrid storage mechanism of solid-state drives (SSD), set the hot data to account for 20% of the data in the corresponding classified data pool, that is, the frequency threshold, and set different frequency thresholds according to the access frequency of the classified data pools in different ecological environments. Trigger time period. In the corresponding time period, when the data access frequency exceeds the preset frequency threshold, the data will be triggered to migrate to the hot data storage layer. Otherwise, the data migration will not be triggered, and the data will remain stored in the cold data storage layer.
进一步的,在热数据存储层中,热数据将被迁移至PostgreSQL数据库中,在阈值触发时间段内,再次访问热数据时不需要调用数据湖,直接在PostgreSQL 数据库中查询;永久性热数据长期存储在PostgreSQL数据库中,不回迁至数据湖冷数据存储层中;当周期性热数据在被频繁访问周期中时,数据被存储至 PostgreSQL数据库中,当一个周期结束时,数据回迁至数据湖冷数据存储层中;当突发性热数据被频繁访问期间触发数据迁移,短期存储至PostgreSQL数据库中,访问结束则回迁至数据湖冷数据存储层;例如,设定2018年200万条气象数据中,访问频次较高的前4万条数据为热数据,对这4万条数据进行统计观察,以判断其为哪类热数据,并按照数据分层机制进行数据迁移。Further, in the hot data storage layer, the hot data will be migrated to the PostgreSQL database. During the threshold trigger time period, the data lake does not need to be called when accessing the hot data again, and the PostgreSQL database is directly queried; the permanent hot data is long-term. It is stored in the PostgreSQL database and is not moved back to the cold data storage layer of the data lake; when the periodic hot data is in the frequently accessed cycle, the data is stored in the PostgreSQL database, and when a cycle ends, the data is moved back to the cold data lake. In the data storage layer; when the sudden hot data is frequently accessed, data migration is triggered, and it is stored in the PostgreSQL database for a short period of time, and after the access is completed, it is moved back to the cold data storage layer of the data lake; for example, set the 2 million weather data in 2018 , the first 40,000 pieces of data with high access frequency are hot data. Statistical observation is made on these 40,000 pieces of data to determine which type of hot data it is, and data migration is carried out according to the data layering mechanism.
步骤S5,生态环境数据处理层L4根据接收到的数据请求,流批一体化处理生态环境数据,按照业务需求查询计算所需数据;具体的,基于Apache开源的分布式流式处理框架Flink,将生态环境批量数据当作有限的流式数据进行处理,等同于实时监测数据、实时发布数据等流式数据处理方式,实现生态环境流式数据与生态环境批量数据进行一体化处理;例如,应用Flink一体化处理实时与离线人口统计数据,降低外部平台/系统调用人口统计数据的延迟,达到秒级甚至亚秒级的延迟;支持外部平台/系统数据处理程序的高可维护性。Step S5, the ecological environment data processing layer L4, according to the received data request, processes the ecological environment data in a stream-batch integrated manner, and queries and calculates the required data according to business requirements; specifically, based on the Apache open source distributed streaming processing framework Flink, The ecological environment batch data is treated as limited stream data, which is equivalent to the real-time monitoring data, real-time release data and other streaming data processing methods, realizing the integrated processing of ecological environment stream data and ecological environment batch data; for example, the application of Flink Integrated processing of real-time and offline demographic data, reducing the delay of external platform/system calling demographic data, reaching second-level or even sub-second-level delay; supporting the high maintainability of external platform/system data processing programs.
本实施例提供的基于数据湖的多源异构生态环境大数据管理系统,包括:生态环境数据采集层、生态环境数据清洗层、生态环境数据存储层、生态环境数据处理层和生态环境数据管理层;其中,所述生态环境数据采集层包括元数据采集模块和数据采集模块;所述元数据采集模块,用于采集多种来源多种结构的生态环境元数据;所述数据采集模块,用于采集多种来源多种结构的生态环境数据;所述生态环境数据清洗层包括数据缺失值处理模块、数据一致性检查模块、数据标准化处理模块;所述数据缺失值处理模块,用于处理生态环境数据中的无效数据与缺失数据;所述数据一致性检查模块,用于检查生态环境数据是否合乎要求;所述数标准化处理模块,用于多源异构数据的标准化处理,实现多源异构生态环境数据映射;所述生态环境数据存储层包括数据分类存储模块和数据分层存储模块;所述数据分类存储模块,用于将数据按照生态环境数据类别存储至相应的分类数据池中;所述数据分层存储模块,用于将分类数据池中的数据按照热度进行分层。所述生态环境数据处理层,包括所述数据流批一体化处理模块,用于一体化处理流批环境数据;所述生态环境数据管理层,用于监控环境数据采集、清洗、存储和处理全过程。本发明能够提供一种基于数据湖的多源异构生态环境大数据管理方法,能够实现生态环境数据之间的互联互通,实现了生态环境数据进行有效的管理和使用,提升了生态环境数据的存取与分析效率,大幅度降低了存储成本。The multi-source heterogeneous ecological environment big data management system based on the data lake provided in this embodiment includes: ecological environment data collection layer, ecological environment data cleaning layer, ecological environment data storage layer, ecological environment data processing layer and ecological environment data management layer wherein, the ecological environment data collection layer includes a metadata collection module and a data collection module; the metadata collection module is used to collect ecological environment metadata from various sources and structures; It is used to collect ecological environment data from various sources and various structures; the ecological environment data cleaning layer includes a data missing value processing module, a data consistency check module, and a data standardization processing module; the data missing value processing module is used for processing ecological Invalid data and missing data in the environmental data; the data consistency check module is used to check whether the ecological environment data meets the requirements; the data standardization processing module is used for the standardization processing of multi-source heterogeneous data to realize multi-source heterogeneous data. The ecological environment data mapping is constructed; the ecological environment data storage layer includes a data classification storage module and a data hierarchical storage module; the data classification storage module is used to store the data in the corresponding classified data pool according to the ecological environment data category; The data layered storage module is used for layering the data in the classified data pool according to the heat. The ecological environment data processing layer includes the data flow and batch integrated processing module for integrated processing of flow and batch environmental data; the ecological environment data management layer is used for monitoring the collection, cleaning, storage and processing of environmental data. process. The invention can provide a multi-source heterogeneous ecological environment big data management method based on a data lake, which can realize the interconnection and intercommunication between ecological environment data, realize the effective management and use of ecological environment data, and improve the ecological environment data. Access and analysis efficiency greatly reduces storage costs.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory, ROM)或随机存储记忆体(Random AccessMemory,RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium. During execution, the processes of the embodiments of the above-mentioned methods may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art who is familiar with the technical scope disclosed by the present invention can easily think of changes or substitutions. All should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.