28 | 5.3. HDFS Design & Limitations

Never stop talking " STOP the Gaza Genocide "

Lesson.No : 28
00:02:34
5.3. HDFS Design & Limitations
Play

Course Lessons

Student Reviews

( 5 Of 5 )

1 review

5 Stars

100%

4 Stars

0%

3 Stars

0%

2 Stars

0%

1 Star

0%

Y

Youtube

02-07-2024

Video of 5.3. HDFS Design & Limitations in Apache Hadoop course by CloudxLab Official channel, video No. 28 free certified online

HDFS is designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. Let’s understand the design of HDFS
It is designed for very large files. “Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size.
It is designed for streaming data access. It is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from the source, and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record
It is designed for commodity hardware. Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on the commonly available hardware that can be obtained from multiple vendors. HDFS is designed to carry on working without a noticeable interruption to the user in case of hardware failure.
It is also worth knowing the applications for which HDFS does not work so well.
HDFS does not work well for Low-latency data access. Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. HDFS is optimized for delivering high throughput and this may be at the expense of latency.
HDFS is not a good fit if we have a lot of small files. Because the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode
If we have multiple writers and arbitrary file modifications, HDFS will not a good fit. Files in HDFS are modified by a single writer at any time.
Writes are always made at the end of the file, in the append-only fashion.
There is no support for modifications at arbitrary offsets in the file.
An HDFS cluster has two types of nodes: one namenode also known as the master and multiple datanodes.
An HDFS cluster consists of many machines. One of these machines is designated as namenode and other machines act as datanodes. Please note that we can also have datanode on the machine where namenode service is running. By default, namenode metadata service runs on port 8020
The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored in RAM and persisted on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located.
Datanodes are the workhorses of the filesystem. While namenode keeps the index of which block is stored in which datanode, datanodes store the actual data. In short, datanodes do not know the name of a file and namenode does not know what is inside a file.
As a rule of thumb, namenode should be installed on the machine having bigger RAM as all the metadata for files and directories is stored in RAM. We will not be able to store many files in HDFS if RAM is not big as there will not be enough space for metadata to fit in RAM. Since data nodes store actual data, datanodes should be run on the machines having the bigger disk.
This Big Data Tutorial will help you learn HDFS, ZooKeeper, Hive, HBase, NoSQL, Oozie, Flume, Sqoop, Spark, Spark RDD, Spark Streaming, Kafka, SparkR, SparkSQL, MLlib, and GraphX from scratch. Everything in this course is explained with the relevant example thus you will actually know how to implement the topics that you will learn in this course.
Let us know in the comments below if you find it helpful.
In order to claim the certificate from E&ICT Academy, IIT Roorkee, visit https://bit.ly/cxlyoutube
________
Website https://www.cloudxlab.com
Facebook https://www.facebook.com/cloudxlab
Instagram https://www.instagram.com/cloudxlab
Twitter http://www.twitter.com/cloudxlab

Relative Courses

Data Warehouse (DWH) 6 Lecturers
ETL Process 3 Lecturers
Spark 4 Lecturers
Data Analysis 5 Lecturers