It’s time for a change for big data storage and analytics

Hadoop legacy

Over time, the Hadoop Big Data framework has run out of steam, but the core concepts it introduces (with MapReduce as the processing principle, bringing data closer to CPUs, using commoditized hardware to build semi-structured data analysis clusters), have become the heart of data analysis systems. modern. Hadoop has also given birth to a rich ecosystem of projects, such as Hive, Kafka, and Impala. As for the story, Hadoop was developed at Yahoo in the early 2000s. The initial idea came from two Google articles (“Google File System” and “MapReduce: Simplified Data Processing on Large Clusters”), and it allowed Yahoo to increase processing power while using low-cost commodity hardware. .

Hadoop promised to disrupt the data warehouse and analytics market that has long been a stronghold for companies like Oracle, Teradata, and IBM. While MapReduce usage has slowly declined, the Hadoop Distributed File System (HDFS) remains prominent as a file system that supports big data applications such as Spark and Kafka. When the Hadoop distributed file system was born, disks were faster than network adapters. So it was created in the form of a cluster of servers that include internal disks, which makes it possible to bring data closer to the account. At the time, SSDs were very expensive and had limited capacity. Mechanical disks were used and most networks were still very much based on Gigabit ethernet. The only way to avoid network bottlenecks was to maintain a strong coupling between storage and computation.

As big data systems evolve into real-time solutions such as Spark and Kafka, storage must become faster and support data analytics technologies. It also needed to scale to hundreds or thousands of nodes, which provides a much larger capacity.

The evolution of big data processing

The heart of most modern big data processing lies in an architectural pattern called Lambda.

This type of processing handles events that occur on a large scale. This data usually comes from sources such as log files or IoT sensors, and then feeds it to a streaming platform like Kafka. Kafka is an open source distributed event streaming platform that can quickly process this data.

We can think of the stream platform as a single query that runs on all the values ​​passing through it. This request acts as a conditional action – for example, if you are processing sensor data, you may want to capture these values ​​so that you can take action. This can take the form of performing a job in the case of an IT system, or an operator physically checking the device. The rest of the values ​​will be pushed to ‘cold’ storage, allowing analysis on a larger scale with value history (trends, machine learning, etc.). This cold layer is usually a data warehouse, or a combination of HDFS and Spark.

One of the difficulties Hadoop had with physical infrastructure with local storage was that to add storage capacity it was also necessary to add compute capacity (again due to the close coupling of CPU + capacity in “share nothing” type architectures), which was expensive than Where acquisition and management costs.

As a result, the challenges posed by Hadoop lead to two major outcomes. First, companies have turned to object storage to make big data storage affordable. The S3 API, the most popular object storage interface, has become an industry standard, allowing companies to store, retrieve, insert, delete, and move data in virtually any object storage. Finally, and more recently, another technological breakthrough has come in the form of the Detailed Shared Everything Architecture (DASE), which allows cluster storage to be scaled independently of CPUs to better meet applications’ capacity and performance needs.

Datalake Rise

The concept of a data lake in which data awaiting analysis is stored in its original form contributed to the rise of modern warehousing. The hardest part of any data analytics or business intelligence system is moving the data from its raw format to one that can be easily queried. Traditionally ETL (Extract, Transform, Load) has been used for this. Hadoop has made it possible to replace this approach by transforming data at request time. (extract, load, convert). This allows different applications to use the same metadata.

Solutions designed to work with Datalakes are fast, flexible, and support a variety of languages. Perhaps the best example of this is Spark, which can function as a distributed data warehouse at much lower costs. However, unlike data analysis frameworks, underlying storage has undergone fundamental changes that enable the scalability, speed, and capacity needed for data analysis.

When it was created, Hadoop was designed to work with very dense and inexpensive mechanical hard drives. However, flash storage has evolved, becoming cheaper, denser, and faster thanks to protocols such as NVMe (Fast Non-Volatile Memory), which offers extremely low latency levels and very dense SSDs. This speed, along with the lower cost of SSDs, represents a fully scalable, affordable flash storage space to support data operations.

Storage has obviously evolved a lot over the years. It has become faster and has much better scalability. The next big challenge is how to store, protect, manage, and extract relevant information from the data in that storage. For this purpose, the DASE (Detailed Common Everything) architecture allows companies to leverage real-time warehousing performance, based on standard protocols such as S3, for all contexts of data analysis, data warehouse, and ad-hoc requests for more complex data science work.

Leave a Comment