Student Reviews
( 5 Of 5 )
1 review
Video of 5.1. HDFS Why HDFS? in Apache Hadoop course by CloudxLab Official channel, video No. 26 free certified online
In this video, we will learn about Hadoop Distributed File System (HDFS), which is one of the main components of Hadoop ecosystem.
Before going into depth of HDFS, let us discuss a problem statement.
If we have 100TB data, How will we design a system to store it? Let’s take 2 minutes to find out possible solutions and then we will discuss it.
One possible solution is to build network-attached storage or storage area network. We can buy hundred 1TB hard disks and mount them to hundred subfolders as shown in the image. What will be the challenges in this approach? Let us take 2 minutes to find out challenges and then we will discuss them.
Let us discuss the challenges.
How will we handle failover and backups?
Failover means switching to a redundant or standby hard disk upon the failure of any hard disk. For backup, we can put extra hard disks or build a RAID i.e. redundant array of independent disks for every hard disk in the system but still it will not solve the problem of failover which is really important for real-time applications.
How will we distribute the data uniformly?
Distributing the data uniformly across the hard disks is really important so that no single disk will be overloaded at any point in time.
Is it the best use of available resources?
There may be other small size hard disks available with us but we may not be able to add them to NAS or SAN because huge files can not be stored in these smaller hard disks. Therefore we will need to buy new bigger hard disks.
How will we handle frequent access to files? What if most of the users want to access the files stored in one of the hard disks. File access speed will be really slow in that case and apparently no user will be able to access the file due to congestion.
How will we scale out?
Scaling out means adding new hard disks when we need more storage. When we will add more hard disks, data will not be uniformly distributed as old hard disks will have more data and newly added hard disks will have less or no data.
To solve above problems Hadoop comes with a distributed filesystem called HDFS. We may sometimes see references to “DFS” informally or in older documentation or configurations.
This Big Data Tutorial will help you learn HDFS, ZooKeeper, Hive, HBase, NoSQL, Oozie, Flume, Sqoop, Spark, Spark RDD, Spark Streaming, Kafka, SparkR, SparkSQL, MLlib, and GraphX from scratch. Everything in this course is explained with the relevant example thus you will actually know how to implement the topics that you will learn in this course.
Let us know in the comments below if you find it helpful.
In order to claim the certificate from E&ICT Academy, IIT Roorkee, visit https://bit.ly/cxlyoutube
________
Website https://www.cloudxlab.com
Facebook https://www.facebook.com/cloudxlab
Instagram https://www.instagram.com/cloudxlab
Twitter http://www.twitter.com/cloudxlab
Join Our Discord Channel to talk to Industry Experts in real-time, to help you choose a roadmap that best suits your Tech Career, using the following link: https://discord.gg/h6qjxU94DC