Student Reviews
( 5 Of 5 )
1 review
Video of 5.5. HDFS File Reading in Apache Hadoop course by CloudxLab Official channel, video No. 30 free certified online
In this video, we discuss how file reading-writing happens in HDFS.
When a user wants to read a file, the client will talk to namenode and namenode will return the metadata of the file. The metadata has information about the blocks and their locations.
When the client receives metadata of the file, it communicates with the datanodes and accesses the data sequentially or parallelly. This way there is no bottleneck in namenode as client talks to namenode only once to get the metadata of the files.
HDFS by design makes sure that no two writers write the same file at the same time by having singular namenode.
If there are multiple namenodes, and clients make requests to these different namenodes, the entire filesystem can get corrupted.
This is because these multiple requests can write to a file at the same time.
Let’s understand how files are written to HDFS.
When a user uploads a file to HDFS, the client on behalf of the user tells the namenode that it wants to create the file.
The namenode replies back with the locations of datanodes where the file can be written.
Also, namenode creates a temporary entry in the metadata.
The client then opens the output stream and writes the file to the first datanode.
The first datanode is the one which is closest to the client machine.
If the client is on a machine which is also a datanode, the first copy will be written on this machine.
Once the file is stored on one datanode, the data gets copied to the other datanodes simultaneously.
Also, once the first copy is completely written, the datanode informs the client that the file is created.
The client then confirms to the namenode that the file has been created.
The namenode crosschecks this with the datanodes and updates the entry in the metadata successfully.
Now, lets try to understand what happens while reading a file from HDFS.
When a user wants to read a file, the HDFS client, on behalf of the user, talk to the namenode.
The Namenode provides the locations of various blocks of this file and their replicas instead of giving back the actual data.
Out of these locations, the client chooses the datanodes closer to it.
The client talks to these datanodes directly and reads the data from these blocks.
The client can read blocks of the file either sequentially or simultaneously.
This Big Data Tutorial will help you learn HDFS, ZooKeeper, Hive, HBase, NoSQL, Oozie, Flume, Sqoop, Spark, Spark RDD, Spark Streaming, Kafka, SparkR, SparkSQL, MLlib, and GraphX from scratch. Everything in this course is explained with the relevant example thus you will actually know how to implement the topics that you will learn in this course.
Let us know in the comments below if you find it helpful.
In order to claim the certificate from E&ICT Academy, IIT Roorkee, visit https://bit.ly/cxlyoutube
________
Website https://www.cloudxlab.com
Facebook https://www.facebook.com/cloudxlab
Instagram https://www.instagram.com/cloudxlab
Twitter http://www.twitter.com/cloudxlab