https://s3.amazonaws.com/amazon-reviews-pds/readme.html
scala> spark.sql("select count(*) from amazon_reviews").show
+---------+
| count(1)|
+---------+
|160796570|
+---------+
scala> spark.sql("select count(*) from amazon_reviews where review_date > '2007' and review_date < '2009'").show
+--------+
|count(1)|
+--------+
| 4683044|
+--------+
scala>
select sum(total_votes), product_category from amazon_reviews group by product_category
takes 15 seconds.
== Parsed Logical Plan ==
GlobalLimit 1001
+- LocalLimit 1001
+- Aggregate [product_category#332], [sum(cast(total_votes#325 as bigint)) AS sum(total_votes)#434L, product_category#332]
+- SubqueryAlias `amazon_reviews`
+- Relation[marketplace#317,customer_id#318,review_id#319,product_id#320,product_parent#321,product_title#322,star_rating#323,helpful_votes#324,total_votes#325,vine#326,verified_purchase#327,review_headline#328,review_body#329,review_date#330,year#331,product_category#332] parquet
== Analyzed Logical Plan ==
sum(total_votes): bigint, product_category: string
GlobalLimit 1001
+- LocalLimit 1001
+- Aggregate [product_category#332], [sum(cast(total_votes#325 as bigint)) AS sum(total_votes)#434L, product_category#332]
+- SubqueryAlias `amazon_reviews`
+- Relation[marketplace#317,customer_id#318,review_id#319,product_id#320,product_parent#321,product_title#322,star_rating#323,helpful_votes#324,total_votes#325,vine#326,verified_purchase#327,review_headline#328,review_body#329,review_date#330,year#331,product_category#332] parquet
== Optimized Logical Plan ==
GlobalLimit 1001
+- LocalLimit 1001
+- Aggregate [product_category#332], [sum(cast(total_votes#325 as bigint)) AS sum(total_votes)#434L, product_category#332]
+- Project [total_votes#325, product_category#332]
+- Relation[marketplace#317,customer_id#318,review_id#319,product_id#320,product_parent#321,product_title#322,star_rating#323,helpful_votes#324,total_votes#325,vine#326,verified_purchase#327,review_headline#328,review_body#329,review_date#330,year#331,product_category#332] parquet
== Physical Plan ==
CollectLimit 1001
+- *(2) HashAggregate(keys=[product_category#332], functions=[sum(cast(total_votes#325 as bigint))], output=[sum(total_votes)#434L, product_category#332])
+- Exchange hashpartitioning(product_category#332, 200)
+- *(1) HashAggregate(keys=[product_category#332], functions=[partial_sum(cast(total_votes#325 as bigint))], output=[product_category#332, sum#440L])
+- *(1) FileScan parquet [total_votes#325,product_category#332] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Volumes/HUDIDATA/input-data/amazon-reviews], PartitionCount: 43, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<total_votes:int>
+- Aggregate [product_category#332], [sum(cast(total_votes#325 as bigint)) AS sum(total_votes)#434L, product_category#332]
takes 17 seconds