With big data being one of the current hot IT trends, more companies are searching for tech workers with big data skills. Hadoop is one of the most widely used tools for storing and managing big data, so if you've worked with it, be sure to play it up on your resume. Be prepared to answer common Hadoop interview questions on every aspect of Hadoop, like these about HDFS and Hadoop clusters:
1. Why do we need a tool like Hadoop?
Big Data deals with large volumes of data, bigger than can be stored on a single file system.
2. Explain the basic architecture of Hadoop.
Hadoop provides the HDFS, Hadoop Distributed File System, that's how Hadoop manages very large files. The files are split into blocks that are stored on DataNodes. There's a NameNode that manages the namespace and controls access to the DataNodes through a master/slave architecture.
3. What is the heartbeat used for?
Because data is stored on many nodes, all nodes need to be accessible for data to be available. The heartbeat is a check between the NameNode and a DataNode to confirm the DataNode is responding.
4. What happens if a DataNode fails to respond?
Hadoop is designed to be resilient, so data is replicated to multiple nodes. When one node becomes unavailable for some reason, user jobs that were accessing that node are shifted to another node which holds that data.
5. What other kinds of nodes are there besides NameNodes and DataNodes?
A CheckpointNode creates namespace checkpoint images. The NameNode doesn't apply updates in real time but just writes them to a log file and merges them the next time it starts up. By having a CheckpointNode periodically create an image, the NameNode can start more quickly. A BackupNode is similar, but receives the edits in realtime from the NameNode.
6. How does rack awareness help performance?
Traffic within a rack is better for performance than traffic that has to go across the network.
7. Hadoop scales by adding nodes to a cluster; can you add a node without shutting down the cluster? How?
Yes, nodes can be added without stopping and restarting the whole cluster. On the master, you need to add a DNS entry for the new node to the conf/slaves file. Then on the slave node, execute the hadoop-daemon.sh script to start the datanode and jobtracker processes.
8. Once new DataNodes are added, does Hadoop automatically balance blocks across them?
Hadoop will create new blocks on the new nodes, but it doesn't automatically redistribute existing blocks. To optimize storage, usually you want data distributed evenly. There's a balancer script that you can use to rebalance the block distribution.
9. Can the balancer be run while Hadoop is in use?
Often, admins run the balancer when the system is "inactive," but you can run it while the system's up and set a parameter, dfs.balance.bandwidthPerSec, to keep the balancer from using too much capacity.
10. What's the best way to copy files between HDFS clusters?
Use multiple nodes and the distcp command so the workload is shared.