vinothchandar/spark-sql-amazon-reviews.md

## spark-sql-amazon-reviews.md

      
    Raw
  

              spark-sql-amazon-reviews.md
            
          
    Dataset

https://s3.amazonaws.com/amazon-reviews-pds/readme.html
scala> spark.sql("select count(*) from amazon_reviews").show
+---------+
| count(1)|
+---------+
|160796570|
+---------+


scala> spark.sql("select count(*) from amazon_reviews where review_date > '2007' and review_date < '2009'").show
+--------+
|count(1)|
+--------+
| 4683044|
+--------+


scala>

Query with no filter

select sum(total_votes), product_category from amazon_reviews group by product_category 

takes 15 seconds.
== Parsed Logical Plan ==
GlobalLimit 1001
+- LocalLimit 1001
   +- Aggregate [product_category#332], [sum(cast(total_votes#325 as bigint)) AS sum(total_votes)#434L, product_category#332]
      +- SubqueryAlias `amazon_reviews`
         +- Relation[marketplace#317,customer_id#318,review_id#319,product_id#320,product_parent#321,product_title#322,star_rating#323,helpful_votes#324,total_votes#325,vine#326,verified_purchase#327,review_headline#328,review_body#329,review_date#330,year#331,product_category#332] parquet

== Analyzed Logical Plan ==
sum(total_votes): bigint, product_category: string
GlobalLimit 1001
+- LocalLimit 1001
   +- Aggregate [product_category#332], [sum(cast(total_votes#325 as bigint)) AS sum(total_votes)#434L, product_category#332]
      +- SubqueryAlias `amazon_reviews`
         +- Relation[marketplace#317,customer_id#318,review_id#319,product_id#320,product_parent#321,product_title#322,star_rating#323,helpful_votes#324,total_votes#325,vine#326,verified_purchase#327,review_headline#328,review_body#329,review_date#330,year#331,product_category#332] parquet

== Optimized Logical Plan ==
GlobalLimit 1001
+- LocalLimit 1001
   +- Aggregate [product_category#332], [sum(cast(total_votes#325 as bigint)) AS sum(total_votes)#434L, product_category#332]
      +- Project [total_votes#325, product_category#332]
         +- Relation[marketplace#317,customer_id#318,review_id#319,product_id#320,product_parent#321,product_title#322,star_rating#323,helpful_votes#324,total_votes#325,vine#326,verified_purchase#327,review_headline#328,review_body#329,review_date#330,year#331,product_category#332] parquet

== Physical Plan ==
CollectLimit 1001
+- *(2) HashAggregate(keys=[product_category#332], functions=[sum(cast(total_votes#325 as bigint))], output=[sum(total_votes)#434L, product_category#332])
   +- Exchange hashpartitioning(product_category#332, 200)
      +- *(1) HashAggregate(keys=[product_category#332], functions=[partial_sum(cast(total_votes#325 as bigint))], output=[product_category#332, sum#440L])
         +- *(1) FileScan parquet [total_votes#325,product_category#332] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Volumes/HUDIDATA/input-data/amazon-reviews], PartitionCount: 43, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<total_votes:int>

+- Aggregate [product_category#332], [sum(cast(total_votes#325 as bigint)) AS sum(total_votes)#434L, product_category#332]