Difference Between HDFS and HBase
In the article HBase vs HDFS, the volume of data is increasing every day and it is most important for organizations to store and process this huge volume of data. HBase, as well as HDFS, are one of the important components of the Hadoop ecosystem which help in storing as well as processing the huge datasets. The data might be structured, semi-structured or unstructured but it can be handled well with HDFS and HBase. HDFS stands for the Hadoop Distributed File System which manages the storage of data across a network of machines and the processing of the huge datasets is done using MapReduce. HDFS is suitable for storing large files with data having a streaming access pattern i.e. write the data once to files and read as many times required. In Hadoop, HBase is the NoSQL database that runs on top of HDFS. HBase stores the data in a column-oriented form and is known as the Hadoop database. HBase provides consistent read and writes in real-time and horizontal scalability.
HDFS (Hadoop Distributed File System) HDFS allows you to store huge amounts of data in a distributed and redundant manner, which runs on commodity hardware. HBase (Hadoop’s database) is a NoSQL database that runs on top your Hadoop cluster
Let us take a look at the components and architecture of HDFS and HBase respectively:
Components of HDFS
- NameNode
- DataNode
NameNode: NameNode can be considered as a master of the system. It maintains the file system tree and the metadata for all the files and directories present in the system. Two files ‘Namespace image’ and the ‘edit log’ are used to store metadata information. Namenode has knowledge of all the data nodes containing data blocks for a given file, however, it does not store block locations persistently. This information is reconstructed every time from data nodes when the system starts.
DataNode: DataNodes are slaves who reside on each machine in a cluster and provide the actual storage. It is responsible for serving, read and write requests for the clients.
HDFS Architecture:-
Components of HBase:-
- Hbase master
- Region Server
- Region
- Zookeeper
HMaster: It is the Master server in HBase architecture. It is the monitoring agent to monitor all Region Server and also it is the responsibility of HMaster to be the interface for all the metadata changes. It runs on NameNode.
Regions Servers: When Region Server receives writes and reads requests from the client, it assigns the request to a specific region, where the actual column family resides. However, the client can directly contact with Region servers, there is no need of HMaster mandatory permission to the client regarding communication with Region Servers. The client requires HMaster help when operations related to metadata and schema changes are required.
Regions: Regions are the basic building elements of the HBase cluster that consists of the distribution of tables and are comprised of Column families. It contains multiple stores, one for each column family. It consists of mainly two components, which are Memstore and Hfile.
ZooKeeper: In Hbase, Zookeeper is a centralized monitoring server that maintains configuration information and provides distributed synchronization. Distributed synchronization is to access the distributed applications running across the cluster with the responsibility of providing coordination services between nodes. If the client wants to communicate with regions, the server’s client has to approach ZooKeeper first.
HBase Architecture:- HBase is a part of Hadoop’s Ecosystem.
In-Depth Model:-
Head to Head Comparison Between HDFS and HBase (Infographics)
Below is the Top 14 Comparison between HDFS vs HBase:
Key Differences Between HDFS and HBase
Below is the difference between HDFS vs HBase are as follows:
- HDFS is a distributed file system that is well suited for the storage of large files. But HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables.
- HDFS has based on GFS file system. But HBase is distributed – uses HDFS for storage, column – Oriented, Multi-Dimensional (Versions) and Storage System
- HDFS uses HIVE as one of its component for the quire language which is HIVE Query Language(HQL), but Hbase is NOT a SQL Database that means:- No Joins, no query engine, no datatypes, no (damn) SQL, No Schema and no DBA needed.
- As HDFS is a distributed storage unit hence have no specific language other than the commands used like the UNIX flavor like for example:- Hadoop dfs -mkdir /foodir
- hadoop dfs -cat /foodir/myfile.txt
- hadoop dfs -rm /foodir/myfile.txt
But on the other hand Hbase has its own interface in the form of Hbase Shell like for example:-
- hbase(main):003:0> create ‘test’, ‘cf’
0 row(s) in 1.2200 seconds
- hbase(main):004:0> put ‘test’, ‘row1’, ‘cf:a’, ‘value1’
0 row(s) in 0.0560 seconds
- hbase(main):005:0> put ‘test’, ‘row2’, ‘cf:b’, ‘value2’
0 row(s) in 0.0370 seconds
- hbase(main):006:0> put ‘test’, ‘row3’, ‘cf:c’, ‘value3’
0 row(s) in 0.0450 seconds
- hbase(main):007:0> scan ‘test’
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1288380727188, value=value1
row2 column=cf:b, timestamp=1288380738440, value=value2
row3 column=cf:c, timestamp=1288380747365, value=value3
3 row(s) in 0.0590 seconds
HDFS and HBase Comparision Table
Following is the comparison table between HDFS and HBase
Basis for Comparison | HDFS | HBase |
Why WE Need them | Need to process huge datasets on large clusters of computers | HBase is a distributed column-oriented data store built on top of HDFS |
Nodes fail every day | a) Failure is expected, rather than exceptional b) The number of nodes in a cluster is not constant |
HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing |
Write Pattern | Append Only | Random write, bulk incremental |
Read Pattern | Full table scan, partition table scan | Random read, small range scan or table scan |
W/R Pattern | HDFS is ideally suited for write-once and read-many times use cases | HBase is ideally suited for random write and read of data that is stored in HDFS. |
Hive(SQL) Performance | Relatively very good | 4-5 times slower |
Structured Storage | Do it yourself or TSV or Sequence File | Sparse column family data model |
Maximum Data Size | Typically can stores near about 30 PB | Approximately around 1 PB |
Dynamic Changes | HDFS has a rigid architecture that does not allow changes. It doesn’t facilitate dynamic storage. | HBase allows for dynamic changes and can be utilized for standalone applications. |
Data Distribution | Data is stored in a distributed manner across the nodes in a cluster. Data is divided into blocks and is then stored over nodes present in HDFS cluster. | Tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows |
Data Storage | All the data is stored in the form of small files and all files are of a typical size of 64 MB (which is 128 MB in the newer version) | All the data is being stored in the form of tables, rows, and columns |
Data Modeling | In HDFS we use the Map Reduce technique which divides the files into the Key – Value pairs | HBase is based on Google’s Bigtable model which uses Key-Value pairs as well |
Operations | It has high latency operations | It has low latency operations |
Accessibility | It is primarily accessed through MR (Map Reduce) jobs | It can be accessed through shell commands, client API in Java, REST, Avro or Thrift |
Conclusion
In overall conclusion, both HDFS and HBase have wonderful technologies on their own. They both HDFS and HBase were created to store the Big Data and to make an easy in accessing and computing them. They both HDFS and HBase go side by side as one HDFS stores the data the other one HBase puts a schema on the data on how to store and retrieve it later for the usage of the client.
Hbase is one of NoSql column-oriented distributed databases available in apache foundation. HBase gives more performance for retrieving fewer records rather than Hadoop or Hive. It’s very easy to search for given any input value because it supports indexing, transactions, and updating.
We can perform online real-time analytics using Hbase integrated with the Hadoop ecosystem. It has an automatic and configurable sharding for datasets or tables and provides restful API’s to perform the MapReduce jobs.
Recommended Articles
This has been a guide to HDFS vs HBase. Here we have covered HDFS vs HBase head to head comparisons, key differences along with infographics and comparison table. You may also look at the following articles to learn more –