samedi 25 avril 2015

SQL/HQL Count Total Not Working on Join


I'm creating a recommender using some simple probability formulas. I'm working with the Million Song Dataset large dataset (189M rows and 3 columns) with my Hadoop cluster, using Hive. Here is a sample of the initial dataset:

million_song - initial dataset
 user   song    play_count
 c3fb2  SOXOQ   18
 c3fb2  SOZVC   1
 6041e  SOBHN   3
 6041e  SOBXT   5

But first I created a table from the initial dataset, which a sample is show here:

million_both - adds up occurrence of two songs listened by same user
 driver_song_id     also_song_id    play_count
 SOXOQ              SOBXT           1642
 SOBHN              SOBXT           2168
 SOBXT              SOZVC           1742

I'm trying to add these two tables above together, and the issue is I can't get the correct count_n_users, which should be the same # for each row in the table. Here is my code:
 select a.driver_song_id, a.also_song_id, a.play_count, 
 count(distinct b.user), sum(distinct b.user) as count_n_users 
 from (select driver_song_id, also_song_id, play_count from million_both) as a 
 join (select user, song, play_count from expp_team.million_song) as b 
 on a.driver_song_id = b.song
 group by a.driver_song_id, a.also_song_id, a.play_count;

The issue I cannot get the correct count_n_users, as of right now it's overstating the value.

This is what I'd like the output to look like:

 driver_song_id     also_song_id    play_count  c(b.user)  count_n_users
 SOXOQ              SOBXT           1643        463         139,738,054           
 SOBHN              SOBXT           2168        483         139,738,054
 SOBXT              SOZVC           1742        725         139,738,054


Aucun commentaire:

Enregistrer un commentaire