I'm creating a recommender using some simple probability formulas. I'm working with the Million Song Dataset large dataset (189M rows and 3 columns) with my Hadoop cluster, using Hive. Here is a sample of the initial dataset:
million_song - initial dataset
user song play_count
c3fb2 SOXOQ 18
c3fb2 SOZVC 1
6041e SOBHN 3
6041e SOBXT 5
But first I created a table from the initial dataset, which a sample is show here:
million_both - adds up occurrence of two songs listened by same user
driver_song_id also_song_id play_count
SOXOQ SOBXT 1642
SOBHN SOBXT 2168
SOBXT SOZVC 1742
I'm trying to add these two tables above together, and the issue is I can't get the correct count_n_users, which should be the same # for each row in the table. Here is my code:
select a.driver_song_id, a.also_song_id, a.play_count,
count(distinct b.user), sum(distinct b.user) as count_n_users
from (select driver_song_id, also_song_id, play_count from million_both) as a
join (select user, song, play_count from expp_team.million_song) as b
on a.driver_song_id = b.song
group by a.driver_song_id, a.also_song_id, a.play_count;
The issue I cannot get the correct count_n_users, as of right now it's overstating the value.
This is what I'd like the output to look like:
driver_song_id also_song_id play_count c(b.user) count_n_users
SOXOQ SOBXT 1643 463 139,738,054
SOBHN SOBXT 2168 483 139,738,054
SOBXT SOZVC 1742 725 139,738,054
Aucun commentaire:
Enregistrer un commentaire