Data Lake VS. Data Warehouse

Two concepts of industrial big data applications, that couldn't be more different.

Jun 2018
Michael Welsch

What is Big Data?

Big Data is about strategies for processing of data amounts, that can’t be processed with conventional computers. The data amounts of big data are to large for one computer. We are, therefore, talking about tera- or petabytes of information, that cannot be grasped in the access of a single computer. For the processing, this data amounts are hence spread over several processors with according memory in a computing center. The deployment requires specially aligned algorithms for the evaluation, if one wants to ”apply” all data against all the other data. A data lake application necessitates an aligned server infrastructure; one leads the algorithms to the amount of data, not the other way around. One should question oneself, though, whether this flexibility is expedient for industrial production data or one should rather process a classical aggregation with the concept of the data warehouse.

Why should one even want to process this much data at the same time?

Algorthims combined with astronomical computational power and storage capacity are capable of something, humans aren’t. Humans are unsurpassedly good in processing of complex information and deciding under uncertainty and fragmentary information situations. Machines are excellent in processing of uniform information and doing incredibly much of this in a split second. The synergetic combination of both worlds lies in the human caring for the complexity of uniform information and then, thanks to the machines, deciding with less uncertainty.

Data lake strategy

In the course of data lake strategy is all data initially gathered on a central point. This concept is standing to reason, if one wants to use all of their already present data in the plant. One assignes the reprogramming of the SPS, so that it provides it’s internal data on the bus system and acquires additional moduls, so called IoT gateways, which now route this data to an also to be acquired or rent computational center, where it, the data, is filed in a corresponding database. There the challenge lies now in filtering the non-correlating data, which normally represent 95%. This is achievable algorithmically, but only combined with highly increased personnel expenditure by data science experts. These value-neutral 95% of the data need to be charged by network- and storage capacity, though. Deleting this data has to be administrated aswell. In the end, huge efforts are made to profit from already present data. This profit often bears no proportion to the effort of a data acquisition. A data box usually means, that a database is also in use. Industrial data is, alongside mechanical data, normally sensor data. A SQL database, though, is only partly suitable for the filing of massive timelines. Even a typical document-orientated noSQL format isn’t basically better. Herefore, a special timeline database is required, which can offer a corresponding API for the residual lake. One data lake technology in particular does, therefore, not exist.

Data Warehouse Strategy

In course of the data warehouse strategy, data is being compressed from warehouse to warehouse due to an cascade of information, resp. Aggregated. Classically, those are operational coefficients. In a warehouse, therefore, the particular sales are aggregated as combined turnover and also the respective parts of this devisions are aggregated in the next and from this warehouse anon the aggregated numbers of the subsidiary are being transmitted to the management. A data warehouse cascade is energetic and geared to the question, that is asked from the highest warehouse. A response is generated by numbers, data and facts due to an aggregation. In case an aggregation based on data present in the lowest warehouse is not possible, the question is passed to the next lowest. In the end new data may has to be acquired. When the same questions are asked frequently the aggregation processes are being automized. Aggregation are therefore, classically, copies. Due to modern IT data can also almost be aggregated or streamed in real-time. Every warehouse is responsible for its own data quality and provides reasonable aggregations only. In case a warehouse is gathering data from several others and reducing it at the same time (via aggregation/feature extraction), it is also causing a data coexistence due to the recoding or requesting to the lower warehouses to acquire resp. provide the data in a different way. In this case sensor data lies in a timeline database and is transmitted via API to e.g. a central SQL, that is specially modeled for this.

Well then?

The implementation of a data lake as well as the setup of a data warehouse are combined with a great amount of work. Both concepts do have in common, that the executation planning of the IoT gateway and the IT need to provide the proper server infrastructure. The data lake concept requires special data scientists, administrators and programmers, which can e.g. operate an installation on a hadoop. Due to their vita, these experts often lack the right understanding for the actuall processes and are even being sealed off from them. This is a poor precondition and does not encourage acceptance for this topic. A certain IT- and algorithm expertise needs to be established in the particular devisions. If this considered as chance for a lasting digitalization strategy, the data warehouse concept has to be the clear favorite for the handling of great amounts of industrial data.

Follow me on
We do not only optimize production processes, but also our website! For this, we use tools such as cookies for analysis and marketing purposes. You can change your cookie settings at any time. Information and Settings