HCatalog add a lot of functionalities when we want to reuse schemas between processing tools (Pig, MapReduce). It greatly simplifies data consumption by storing the metastore of Hive tables. There's a great documentation about using HCatalog there. Now, let's look at this scenario. A developer created an external table in Hive and another wants to use it in Pig.
For example, a Pig developer want to use the HiveSampleTable. This is a sample table that is created when you create an HDInsight cluster. Without the HCatalog, the Pig developer would have to know where the data is stored and its structure. First thing first, she needs to instruct Pig to sue HCatalog using this switch when Pig is called:
C:\apps\dist\hadoop-2.4.0.2.1.15.1-1234>%pig_home%\bin\pig -useHCatalog;
The above command allows Pig to leverage the HCatalog.
Then, she can declare a variable that points to the HIveSampleTable in Hive.
SampleTable = LOAD 'HiveSampleTable' USING org.apache.hive.hcatalog.pig.HCatLoader();
2015-09-11 23:01:59,308 [main] INFO hive.metastore - Trying to connect to metastore with URI thrift://headnodehost:9083
2015-09-11 23:01:59,391 [main] INFO hive.metastore - Connected to metastore.
2015-09-11 23:02:00,058 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
Now, if she calls a describe command, here is what she gets:
grunt> DESCRIBE SampleTable;
2015-09-11 23:03:58,829 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
SampleTable: {clientid: chararray,querytime: chararray,market: chararray,deviceplatform: chararray,devicemake: chararray,devicemodel: chararray,state
le,sessionid: long,sessionpagevieworder: long}
We clearly see here that we can leverage the HCatalog metastore since Pig has recognized the underneath file and its structure. Moving forward, we can now interact with this variable like we would do without knowledge of the file location and schema.
Happy HDInsight coding! J