Modern Big Data Processing with Hadoop
上QQ阅读APP看书,第一时间看更新

Which file format is better?

The answer is: it depends on your use cases. Generally, the criteria for selecting a file format is based on query-read and query-write performance. Also, it depends on which Hadoop distribution you are using. The ORC file format is the best for Hive and Tez using the Hortonworks distribution and a parquet file is recommended for Cloudera Impala implementations. For a use case involving schema evolution, Avro files are best suited. If you want to import data from RDBMS using Sqoop, text/CSV file format is the better choice. For storing map intermediate output, a sequence file is the ultimate choice.