Chapter 1
Apache Hive
While you need Hadoop for reliable, scalable, distributed computing, the learning curve for extracting data is just too steep to be time effective and cost efficient.
What is Apache Hive?
Apache Hive is a popular data warehouse software that enables you to easily and quickly write SQL-like queries to efficiently extract data from Apache Hadoop.
Hadoop is an open-source framework for storing and processing massive amounts of data. While Hadoop offers many advantages over traditional relational databases, the task of learning and using Hadoop is daunting since it requires SQL queries to be implemented in the MapReduce Java API.
To resolve this formidable issue, Facebook developed the Apache Hive data warehouse so they could bypass writing Java and simply access data using simple SQL-like queries.
Today, Apache Hive’s SQL-like interface has become the Gold Standard for making ad-hoc queries, summarizing, and analyzing Hadoop data. This solution is particularly cost effective and scalable when assimilated into cloud computing networks, which is why many companies, such as Netflix and Amazon, continue to develop and improve Apache Hive.
The most predominant use cases for Apache Hive are to batch SQL queries of sizable data sets and to batch process large ETL and ELT jobs.
How Does Apache Hive Work?
In short, Apache Hive translates the input program written in the HiveQL (SQL-like) language to one or more Java MapReduce, Tez, or Spark jobs. (All of these execution engines can run in Hadoop YARN.) Apache Hive then organizes the data into tables for the Hadoop Distributed File System HDFS) and runs the jobs on a cluster to produce an answer.
Apache Hive Data
The Apache Hive tables are similar to tables in a relational database, and data units are organized from larger to more granular units. Databases consist of tables that are made up of partitions, which can further be broken down into buckets. The data is accessed through HiveQL (Hive Query Language) and can be overwritten or appended. Within each database, table data is serialized, and each table has a corresponding HDFS directory.
Apache Hive Architecture
Multiple interfaces are available, from a web browser UI, to a CLI, to external clients. The Apache Hive Thrift server enables remote clients to submit commands and requests to Apache Hive using a variety of programming languages. The central repository for Apache Hive is a metastore that contains all information, such as all table definitions.
The engine that makes Apache Hive work is the driver, which consists of a compiler, an optimizer to determine the best execution plan, and an executor. Optionally, Apache Hive can be run with LLAP. Note that for high availability, you can configure a backup of the metadata.
Apache Hive Security
Apache Hive is integrated with Hadoop security, which uses Kerberos for a mutual authentication between client and server. Permissions for newly created files in Apache Hive are dictated by the HDFS, which enables you to authorize by user, group, and others.
Benefits of Apache Hive
Apache Hive is ideal for running end-of-day reports, reviewing daily transactions, making ad-hoc queries, and performing data analysis. Such deep insights made available by Apache Hive render significant competitive advantages and make it easier for you to react to market demands.
Following are a few of the benefits that make such insights readily available:
Ease of use — Querying data is easy to learn with its SQL-like language.
Accelerated initial insertion of data — Data does not have to be read, parsed, and serialized to a disk in the database’s internal format, since Apache Hive reads the schema without checking the table type or schema definition. Compare this to a traditional database where data must be verified each time it is inserted.
Superior scalability, flexibility, and cost efficiency — Apache Hive stores 100s of petabytes of data, since it stores data in the HDFS, making it a much more scalable solution than a traditional database. As a cloud-based Hadoop service, Apache Hive enables users to rapidly spin virtual servers up or down to accommodate fluctuating workloads.
Streamlined security — Critical workloads can be replicated for disaster recovery.
Low overhead — Insert-only tables have near-zero overhead. Since there is no renaming required, the solution is cloud friendly.
Exceptional working capacity — Huge datasets support up to 100,000 queries/hour.
