With big data being one of the current hot IT trends, more companies are searching for tech workers with big data skills. Hadoop is one of the most widely used tools for storing and managing big data, so if you've worked with it, be sure to play it up on your resume. Be prepared to answer common Hadoop interview questions on every aspect of Hadoop, like these about HDFS and Hadoop clusters:
1. Why do we need a tool like Hadoop?
Big Data deals with large volumes of data, bigger than can be stored on a single file system.
2. Explain the basic architecture of Hadoop.
Hadoop provides the HDFS, Hadoop Distributed File System, that's how Hadoop manages very large files. The files are split into blocks that are stored on DataNodes. There's a NameNode that manages the namespace and controls access to the DataNodes through a master/slave architecture.
3. What is the heartbeat used for?
Because data is stored on many nodes, all nodes need to be accessible for data to be available. The heartbeat is a check between the NameNode and a DataNode to confirm the DataNode is responding.
4. What happens if a DataNode fails to respond?
Hadoop is designed to be resilient, so data is replicated to multiple nodes. When one node becomes unavailable for some reason, user jobs that were accessing that node are shifted to another node which holds that data.
5. What other kinds of nodes are there besides NameNodes and DataNodes?
A CheckpointNode creates namespace checkpoint images. The NameNode doesn't apply updates in real time but just writes them to a log file and merges them the next time it starts up. By having a CheckpointNode periodically create an image, the NameNode can start more quickly. A BackupNode is similar, but receives the edits in realtime from the NameNode.
6. How does rack awareness help performance?
Traffic within a rack is better for performance than traffic that has to go across the network.
7. Hadoop scales by adding nodes to a cluster; can you add a node without shutting down the cluster? How?
Yes, nodes can be added without stopping and restarting the whole cluster. On the master, you need to add a DNS entry for the new node to the conf/slaves file. Then on the slave node, execute the hadoop-daemon.sh script to start the datanode and jobtracker processes.
8. Once new DataNodes are added, does Hadoop automatically balance blocks across them?
Hadoop will create new blocks on the new nodes, but it doesn't automatically redistribute existing blocks. To optimize storage, usually you want data distributed evenly. There's a balancer script that you can use to rebalance the block distribution.
9. Can the balancer be run while Hadoop is in use?
Often, admins run the balancer when the system is "inactive," but you can run it while the system's up and set a parameter, dfs.balance.bandwidthPerSec, to keep the balancer from using too much capacity.
10. What's the best way to copy files between HDFS clusters?
Use multiple nodes and the distcp command so the workload is shared.
Big Data is big. The technology is now being used across all industries, from manufacturing to healthcare to even relatively low-tech retail and hospitality firms. The main technology behind Big Data, Hadoop is a framework that lets calculations on massive data sets take place on clustered nodes of inexpensive hardware, often in the cloud.
Big Corporate Commitments to Big Data
According to the research firm Gartner, more than 40 percent of companies they surveyed will invest in Hadoop development over the next two years. In the manufacturing industry, another survey showed big data was a priority for more than 80% of firms.
For many companies, one of the biggest stumbling blocks is a lack of familiarity with the technology and a lack of staff with the necessary experience. Because of this, developers with Hadoop skills are able to pull down big salaries – the average annual salary for Hadoop developers is more than $115,000.
Multiple Options for Companies Using Hadoop
Companies that sell big data products are trying to reduce the skills threshold in several ways. All vendors offer training, of course. Cloud providers including Amazon and Google offer Hadoop as a Service, letting businesses more easily spin up a Hadoop environment. These on-demand environments let companies dive right into the analysis that matters to them, rather than focusing on details like provisioning nodes and tuning clusters. Companies like Oracle provide pre-packaged analytics for specific industries.
Multiple Options for Developers with Hadoop Skills
All of that means that developers with Hadoop skills have lots of opportunity available to them, including working with a company implementing its own big data projects, a cloud vendor implementing big data environments on demand, and a packaged software vendor creating standard analytics reports.
Get Training in Big Data Skills
Developers who want to work with Big Data should get training in Hadoop, but that's not the only skill they need. Big Data depends on databases, and NoSQL is the chief database technology used. Although many Big Data developers will work with vendor analytics products, understanding data mining and statistical analysis is still necessary. Big Data developers should also have real familiarity with at least one of the vendor Hadoop as a Service offerings.
Currently, most big data opportunities are in geographic areas with large clusters of technology firms, like Silicon Valley, New York, and Seattle. As big data usage continues to spread, so will the need for its skills, meaning the opportunities will spread across the country.