تقييمات الطلاب
( 5 من 5 )
١ تقييمات
فيديو شرح 5.4. HDFS Replication ضمن كورس Apache Hadoop شرح قناة CloudxLab Official، الفديو رقم 29 مجانى معتمد اونلاين
In this video, we discuss how does the replication in HDFS work.
Let’s understand the concept of blocks in HDFS. When we store a file in HDFS, the file gets split into the chunks of 128MB block size. Except for the last block all other blocks will have 128 MB in size. The last block may be less than or equal to 128MB depending on file size. This default block size is configurable.
Let’s say we want to store a 560MB file in HDFS. This file will get split into 4 blocks of 128 MB and one block of 48 MB
What are the advantages of splitting the file into blocks?
It helps fitting big file into smaller disks.
It leaves less unused space on the every datanode as many 128MB blocks can be stored on the each datanode.
It optimizes the file transfer.
Also, it distributes the load to multiple machines. Let’s say a file is stored on 10 data nodes, whenever a user accesses the file, the load gets distributed to 10 machines instead of one machine.
It is same like when we download a movie using torrent. The movie file gets broken down into multiple pieces and these pieces get downloaded from multiple machines parallelly. It helps in downloading the file faster.
Let’s understand the HDFS replication. Each block has multiple copies in HDFS. A big file gets split into multiple blocks and each block gets stored to 3 different data nodes. The default replication factor is 3. Please note that no two copies will be on the same data node. Generally, first two copies will be on the same rack and the third copy will be off the rack (A rack is an almirah where we stack the machines in the same local area network). It is advised to set replication factor to at least 3 so that even if something happens to the rack, one copy is always safe.
We can set the default replication factor of the file system as well as of each file and directory individually. For files which are not important we can decrease the replication factor and for files which are very important should have high replication factor.
Whenever a datanode goes down or fails, the namenode instructs the datanodes which have copies of lost blocks to start replicating the blocks to the other data nodes so that each file and directory again reaches the replication factor assigned to it.
This Big Data Tutorial will help you learn HDFS, ZooKeeper, Hive, HBase, NoSQL, Oozie, Flume, Sqoop, Spark, Spark RDD, Spark Streaming, Kafka, SparkR, SparkSQL, MLlib, and GraphX from scratch. Everything in this course is explained with the relevant example thus you will actually know how to implement the topics that you will learn in this course.
Let us know in the comments below if you find it helpful.
In order to claim the certificate from E&ICT Academy, IIT Roorkee, visit https://bit.ly/cxlyoutube
________
Website https://www.cloudxlab.com
Facebook https://www.facebook.com/cloudxlab
Instagram https://www.instagram.com/cloudxlab
Twitter http://www.twitter.com/cloudxlab
Join Our Discord Channel to talk to Industry Experts in real-time, to help you choose a roadmap that best suits your Tech Career, using the following link: https://discord.gg/h6qjxU94DC