What is pig technology?
Apache Pig is an open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on Hadoop clusters.
1) Hive Hadoop Component is used mainly by data analysts whereas Pig Hadoop Component is generally used by Researchers and Programmers. 2) Hive Hadoop Component is used for completely structured Data whereas Pig Hadoop Component is used for semi structured data.
- Avro stores the data definition in JSON format making it easy to read and interpret, the data itself is stored in binary format making it compact and efficient. Avro files include markers that can be used to splitting large data sets into subsets suitable for Apache MapReduce™ processing.
- Apache Hive: In Hadoop the only way to process data was through a MapReduce job. Also Pig has its own language called Pig Latin which also turns your Pig Latin script to a series of MapReduce jobs. Apache Sqoop: This is a tool which is used to import RDBMS data to Hadoop.
- ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.
Apache Hive is a component of Hortonworks Data Platform(HDP). Hive provides a SQL-like interface to data stored in HDP. In the previous tutorial, we used Pig, which is a scripting language with a focus on dataflows. Hive provides a database query interface to Apache Hadoop.
- Hive is a trademark owned by Centrica Hive Limited that produces smart home devices. Using the Hive app and website, customers can control a range of internet-connected devices, from a thermostat, to lights, smart plugs, motion sensors, and window and door sensors.
- Oracle NoSQL Database is a scalable, distributed NoSQL database, designed to provide highly reliable, flexible and available data management across a configurable set of storage nodes. Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
- The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks.
Apache Pig - Architecture. The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel data processing language which provides a rich set of data types and operators to perform various operations on the data.
- e) Metastore – Metastore is the central repository of Apache Hive metadata in the Hive Architecture. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. A service that provides metastore access to other Apache Hive services.
- Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. Oozie is a scalable, reliable and extensible system.
- Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley's AMPLab, and open sourced in 2010 as an Apache project.
Updated: 3rd October 2019