Pages

Atepassar Recommendations: Recommending friends with MapReduce and Python

Sunday, October 28, 2012

Hi all,

In this post I will present one of the tecnhiques used at Atépassar, a brazilian social network that help students around Brazil in order to pass the exams for a civil job, our recommender system.



I will describe some of the data models that we use and discuss our approach to algorithmic innovation that combines offline machine learning with online testing.  For this task we use distributed computing since we deal with over with 140 thousand users. MapReduce is a powerful technique and we use it by writting in python code with the framework MrJob.  I recommend you to read further about it at my last post here.

One of our recommender techniques is the simple 'people you might know' recommender algorithm.  Indeed, there are several components behind the algorithm since at Atépassar, users can follow other people  as also be followed by other people.  In this post I will talk about the basic idea of the algorithm which can be derivated for those other components.  The idea of the algorithm is that if person A and person B do know each other but they have a lot of mutual friends, then the system should recommend that they connect with each other.


Person A and Person B has common friends, it should recommend each other 

We will implement this algorithm using the MapReduce architecture in order to use Hadoop,  which is a open source software for highly reliable, scalable distributed computing.  I assume that you already familiar with those concepts, if not please take a look at those posts and see what map and reduce jobs are.


But before introducing the algoritm, let's present a simple algorithm in case of the bi-directional connections, that is, if I am your friend, you are also my friend. In order to recommend such friends, we first need to count the number of mutual friends that each pair of users have in the network. For this, we will need to implement a map reduce job that works similar to  the job that count the frequency of words in a file.  For every pair of friends in a list, we output the tuple {friend1, friend2}, 1:

class FriendsRecommender(MRJob):
def steps(self):
return [self.mr(self.map_input, self.count_number_of_friends),
self.mr(self.count_max_of_mutual_friends,
self.top_recommendations)]
def map_input(self, key, line):
'''
Compute a cartesian product using nested loops
for each friend in connection_list
Input (source -> {“friend1”, “friend2”, “friend3”}):
marcel,jonas,maria,jose,amanda
Output {[source, friend1], -1};
{[friend1, friend2], 1};):
["jonas", "marcel"] -1
["jonas", "maria"] 1
["jonas", "jose"] 1
["amanda", "jonas"] 1
["marcel", "maria"] -1
["jose", "maria"] 1
["amanda", "maria"] 1
["jose", "marcel"] -1
["amanda", "jose"] 1
["amanda", "marcel"] -1
'''
input = line.split(';')
user_id, item_ids = input[0], input[1:]
for i in range(len(item_ids)):
f1 = item_ids[i]
if user_id < f1:
yield (user_id, f1), -1
else:
yield (f1, user_id), -1
for j in range(i + 1, len(item_ids)):
f2 = item_ids[j]
if f1 < f2:
yield (f1, f2), 1
else:
yield (f2, f1), 1


In addition to this, we also output a tuple with -1 for every pair of users who are friends {user_id, friend}, -1.


Now, the reducer gets a input key denoting a pair of friends along with their list of their number of common friends.

In the reduce function, we can check if the first record has a -1 in the value. If there is such a reccord, we can ignore that pair of friends, because they already have a direct connection between them. Finally we aggregate the values of all the keys in the reducer and output the tuple for all the pair of friends which don't have third attribute as -1 in the key.

def count_number_of_friends(self, key, values):
'''
Count the number of mutual friends.
Input ({[friend1, friend2], [-1, -1]};
{[friend1, friend2],[1, 1]};):
["jonas", "marcel"] -1
["marcel", "maria"] -1
["jose", "marcel"] -1
["amanda", "marcel"] -1
["fabio", "marcel"] 1
["fabiola", "marcel"] 1
["carol", "marcel"] 1
["carol", "marcel"] 1
Output ({[friend1, friend2], numberOfMutualFriends}):
["fabio", "marcel"] 1
["fabiola", "marcel"] 1
["marcel", "patricia"] 1
["marcel", "paula"] 1
["carol", "marcel"] 2
'''
f1, f2 = key
mutual_friends_count = 0
for value in values:
if value == -1:
return
mutual_friends_count += value
yield (f1, f2), mutual_friends_count

After the first map-reduce job, we obtain a list of pair of persons along the number of common friends that they have. Our final map-reduce job looks at this list {[friend1, friend2], numberOfMutualFriends} and outputs a list of persons they have maximum number of common friends with. Our map job outputs {friend1,[numberOfMutualFriends, friend2]}, {friend2,  [numberOfMutualFriends, friend1]}  .
def count_max_of_mutual_friends(self, key, values):
'''
Prepare the dataset to yield the source and
get the top suggestions.
Input ({[friend1, friend2], numberOfMutualFriends}):
["fabio", "marcel"] 1
["fabiola", "marcel"] 1
["marcel", "patricia"] 1
["marcel", "paula"] 1
["carol", "marcel"] 2
Output ({friend1, [numberOfMutualFriends, friend2]},
{friend2, [numberOfMutualFriends, friend1]}):
"fabio", [1,"marcel"]
"marcel", [1,"fabio"]
"fabiola", [1,"marcel"]
"marcel", [1,"fabiola"]
"marcel", [1,"patricia"]
"patricia", [1,"marcel"]
"marcel", [1,"paula"]
"paula", [1,"marcel"]
"marcel", [2,"carol"]
"carol", [2,"marcel"]
'''
f1, f2 = key
# for score in values:
yield f1, (f2, int(values))
yield f2, (f1, int(values))
def top_recommendations(self, key, values):
'''
Get the TOP N recommendations for user.
Input ({friend1, [(numberOfMutualFriends, friend),
(numberOfMutualFriends2, friend)]}):
"fabio", [[1,"marcel"]]
"marcel", [[2,"carol"], [1,"fabio"], [1,"fabiola"], [1,"patricia"],
[1,"paula"]]
"fabiola", [[1,"marcel"]]
"patricia", [[1,"marcel"]]
"paula", [[1,"marcel"]]
"carol", [[2,"marcel"]]
Output ({friend1, [(numberOfMutualFriends, friend),
(numberOfMutualFriends2, friend)]}):
Ex: Get the top 3 suggestions.
"fabio", [[1,"marcel"]]
"marcel", [[2,"carol"], [1,"fabio"], [1,"fabiola"]]
"fabiola", [[1,"marcel"]]
"patricia", [[1,"marcel"]]
"paula", [[1,"marcel"]]
"carol", [[2,"marcel"]]
'''
recommendations = []
for idx, (item, score) in enumerate(values):
recommendations.append((item, score))
yield key, sorted(recommendations, key=lambda k: -k[1])[:TOP_N]

The reducer would look at the person in the key and our comparator would look at the numberOfCommonFriends to sort the keys. This would ensure that the tuples for the same person go on the same reducer and in a sorted order by the number of common friends. Our reducer then just need to look at the top 5 values and output the list (TOP_N).

Now, we run through our social network data.


marcel;jonas,maria,jose,amanda
maria;carol,fabiola,amanda,marcel
amanda;paula,patricia,maria,marcel
carol;maria,jose,patricia
fabiola;maria
paula;fabio,amanda
patricia;amanda,carol
jose;marcel,carol
jonas;marcel,fabio
fabio;jonas,paula
carla


Let's see some interesting stuff. The recommended friends to me are:


"marcel" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["patricia", 1], ["paula", 1]]
"maria" [["jose", 2], ["patricia", 2], ["jonas", 1], ["paula", 1]]
"patricia" [["maria", 2], ["jose", 1], ["marcel", 1], ["paula", 1]]
"paula" [["jonas", 1], ["marcel", 1], ["maria", 1], ["patricia", 1]]
"amanda" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["jonas", 1], ["jose", 1]]
"carol" [["amanda", 2], ["marcel", 2], ["fabiola", 1]]
"fabio" [["amanda", 1], ["marcel", 1]]
"fabiola" [["amanda", 1], ["carol", 1], ["marcel", 1]]
"jonas" [["amanda", 1], ["jose", 1], ["maria", 1], ["paula", 1]]
"jose" [["maria", 2], ["amanda", 1], ["jonas", 1], ["patricia", 1]]



As I expected the person with most common friends to me is carol, with maria and jose.

There are  still plenty of things that can be done here in the implementation in order to have our final recommender, for example here we are not considering the followers in common or whether if we live at the same state.  Even the results and scalability can be improved.


Facebook Data


And what about recommending friends from Facebook?  I decided to mine some data based on connections between my connections.

$python  friends_recommender.py - r emr --num-ec2-instances 5  facebook_data.csv > output.dat

Let's see some results:

Let's pick my friend Rafael Carício. The top suggestions for him would be:

"Rafael_Caricio_584129827" [["Alex_Sandro_Gomes_625299988", 28], ["Oportunidadetirecife_Otir_100002020544265", 26], ["Thiago_Diniz_784848380", 20], ["Guilherme_Barreto_735956697", 18], ["Andre_Ferraz_100001464967635", 17], ["Sofia_Galvao_Lima_1527232153", 16], ["Edmilson_Rodrigues_1003183323", 15], ["Pericles_Miranda_100001613052998", 14], ["Andre_Santos_100002368908054", 14], ["Edemilson_Dantas_100000732193812", 13]]
Rafael is my old partner and studied with me at UFPE (University). All recommended are entrepreneurs or old students from CIN/UFPE.

And about my colleague Osvaldo Santana, one of the most famous python developers in Brazil.

"Osvaldo_Santana_Neto_649598880" [["Hugo_Lopes_Tavares_100000635436030", 14], ["Francisco_Souza_100000560629656", 12], ["Daker_Fernandes_Pinheiro_100000315704652", 8], ["Flavio_Ribeiro_100000349831236", 6], ["Jinmi_Lee_1260075333", 5], ["Alex_Sandro_Gomes_625299988", 5], ["Romulo_Jales_100001734547813", 5], ["Felipe_Andrade_750803015", 5], ["Adones_Cunha_707904846", 5], ["Flavio_Junior_100000544023443", 4]]

Interesting! Recommended other Python Developers as also some entrepreneurs! :)

We could play more with it, but if you want to test with your own data download it by using this app.

Twitter Friends

Great, but how can I use this algorithm in Twitter for example ?! It's a little different. In this scenario we don't assume the bi-directional link. Only because I follow you on Twitter it does not mean that you also follow me. We have now two entities: followers and the friends. 

The basic idea is to count the number of directed paths. See the full code below:
#-*-coding: utf-8 -*-
'''
This module represents the recommender system for recommending
new friends based on 'mutual friends'.
'''
__author__ = 'Marcel Caraciolo <caraciol@gmail.com>'
from mrjob.job import MRJob
def combinations(iterable, r):
"""
Implementation of itertools combinations method. Re-implemented here because
of import issues in Amazon Elastic MapReduce. Was just easier to do this than
bootstrap.
More info here: http://docs.python.org/library/itertools.html#itertools.combinations
Input/Output:
combinations('ABCD', 2) --> AB AC AD BC BD CD
combinations(range(4), 3) --> 012 013 023 123
"""
pool = tuple(iterable)
n = len(pool)
if r > n:
return
indices = range(r)
yield tuple(pool[i] for i in indices)
while True:
for i in reversed(range(r)):
if indices[i] != i + n - r:
break
else:
return
indices[i] += 1
for j in range(i + 1, r):
indices[j] = indices[j - 1] + 1
yield tuple(pool[i] for i in indices)
TOP_N = 10
class FriendsRecommender(MRJob):
def steps(self):
return [self.mr(self.map_input, self.reverse_index),
self.mr(None, self.count_number_of_friends),
self.mr(None, self.top_recommendations)]
def map_input(self, key, line):
'''
Compute the reversed index for each connection
between the source.
Input (source -> {“friend1”, “friend2”, “friend3”}):
marcel,jonas,maria,jose,amanda
Output {[friend1], source, 1,
{“friend1”, “friend2”, “friend3”}};
{[friend2], source, 1,
{“friend1”, “friend2”, “friend3”}};
'''
input = line.split(';')
user_id, item_ids = input[0], input[1:]
for i in range(len(item_ids)):
f1 = item_ids[i]
yield f1, (user_id, 1, item_ids)
def reverse_index(self, key, values):
'''
Compute the cartesian product between among all
sources that shares common friends.
Input (friend1 -> (source, 1,
{“friend1”, “friend2”, “friend3”}};
marcel, jonas,maria,jose,amanda
Output {(jonas, maria), 1}, {(maria, jose), 1},
{(jonas,amanda), 1}, {(maria, amanda),1},
{(jose, jonas), 1}
'''
final_list = []
for value in values:
final_list.append(value)
for item1, item2 in combinations(final_list, 2):
f1, count1, f1_friends = item1
f2, count2, f2_friends = item2
if f1 not in f2_friends:
yield (f2, f1), (count1)
if f2 not in f1_friends:
yield (f1, f2), (count2)
def count_number_of_friends(self, key, values):
'''
Count the number of mutual friends.
Input {(source, recommended), [count1, count2, ... count n]}
{(jonas, maria), [1,1,1]},
{(jose, jonas), [1,1]},
{(maria, amanda), [1]},
{(maria,jose), [1]},
{(jonas,amanda), [1, 1, 1]}
Output ({friend1, friend2} numberOfMutualFriends):
(jonas, marcel ) 1
(fabiola, marcel) 1
(marcel, patricia) 1
(marcel, paula) 1
(carol, marcel) 2
'''
user_id, possible_friend = key
yield (user_id), \
(possible_friend, sum(values))
def top_recommendations(self, key, values):
'''
Get the TOP N recommendations for user.
Input ({user_id, [(numberOfMutualFriends, friend),
(numberOfMutualFriends2, friend)]}):
[{marcel, [(patricia, 1), (paula, 1)]},
{carol, [(marcel, 2)]},
{fabiola, [(marcel,1)]},
{jonas, [(marcel,1)]}]
Output ({user_id, [(numberOfMutualFriends, friend),
(numberOfMutualFriends2, friend)]}):
Ex: Get the top 3 suggestions.
[{marcel, [(patricia, 1), (paula, 1)]},
{carol, [(marcel, 2)]},
{fabiola, [(marcel,1)]},
{jonas, [(marcel,1)]}]
'''
recommendations = []
for idx, (item, score) in enumerate(values):
recommendations.append((item, score))
yield key, sorted(recommendations, key=lambda k: -k[1])[:TOP_N]
if __name__ == '__main__':
FriendsRecommender.run()
Let's see how I can get some new friends to recommend based on the list of users that posts about #recsys that I created.

Running over twitter and that is:
   
    "@marcelcaraciolo" [["@alansaid"  24],   ["@zennogantner"  21], ["@neal_lathia"  21] ]


Great I didn't know! :D I liked the recommendations. By the way they are great references in recsys.  But let's go further...  Let's check the explanations?  Why those users are recommended to me ?  It is an important feature for improving the acceptance of the given recommendation.

Making some changes at my code,  it gives us the output below:

    "@marcelcaraciolo" [["@alansaid"  24, ["@pankaj", "@recsyschallenge", "@kdnuggets",
    "@mitultiwari", "@sidooms", "@recsys2012", "@khmcnally", "@McWillemsem", "@LensKitRS",
   "@pcastellis",  "@dennisparra", "@filmaster",  "@BamshadMobasher", "@sadrewge",
   "@totopampin",  "@recsyshackday", "@plamere", "@usabart", "@mymedialite', "@reclabs",
   "@elehack","@omdb", "@osbourke", "@siah"]],    ["@zenogantner"  21, ["@sandrewge",
    "@nokiabetalabs", "@foursquareAPI",  "@mitultiwari", "@kiwitobes", "@directededge",
    "@plista", "@twitterapi", "@recsys2012",  "@xamat", "@tlalek", "@namp",
  "@lenskitRS",  "@siah", "@ocelma",  "@abellogin", "@mymedialite', "@totopampin",  "@RecsysWiki","@ScipyTip", "@ogrisel"]], ["@neal_lathia"  21,  ["@googleresearch",
    "@totopampin", "@ushahidi",  "@kaythaney", "@gj_uk", "@hmason",
    "@jure", "@ahousley", "@peteskomoroch",  "@xamat", "@tnhh", "@elizabethmdaly",
  "@recsys2012",  "@sandrewge", "@matboehmer",  "@abellogin", "@pankaj', "@jerepick",  "@alsothings","@edchi", "@zenogantner"]]

Recommendations with explanations and also distributed :D

Atépassar Extension


Ok, I haven't talked yet about Atépassar. How we recommend new friends at Atepassar ?  Atepassar is designed as Twitter, so we also have the entities friends and followers. Besides the common friends we add other attributes in the final score. For instance let's consider now the state where each user lives informed by his user profile at Atepassar. The idea is to recommend people with mutual friends and also weighted by people who lives at the same state.  For the purpose of illustration, let's start with a simple scoring approach by choosing our function to be a linear combination of state and mutual friends similarity. This gives an equation of the form 
                                         fsim(u,i) = w1 f1(u,i) + w2 f2(u,i) + b, 

where u = user, i = the new friend, f1 = the friendship similarity and f2 = the state similarity.  This equation defines a two-dimensional space like the one illustrated below:


This is a sample of the my input:

murilodumps;SC;marcoscampelo
dan_sampaio;RJ
marvinslap;PE;jwalker;marcoscampelo;juliocesarfort;guilhermepaiva;hugofsantiago;bruno;naevio;Mila21Sousa;dansampaio;thaisleao;lucianaamancio;pedro;marciomarques83;x5;narah;pamelaresende;carolcani;itl;roberta_ferreira;sexxyalice;annelouiseadv;bribarbosa;espertinha3;fzanchin;claudiasm;cauecamacho;lucianac;bicalhoferreira;dougui;monnyke;mariaaugusta38;germanabarros;professor1;rosimartome;klauklau;lugentil;rodrigo_miranda;portoalegre;mczmendonca;itsfabio;CarolFollador;ricardofay;lorenameimei;josi_patricia;analaurafonseca;daiana_ugulino;narelle_moraes;kamyu;wallacevidal;falcaoblanco;julianabraggio;tiagoop;giselevix10;natanaelsilva;giovannafc;vivoquinha;alinne_silva_oliveira;amanda_cutrim;gabrielagrisa;bruna_estudando;valquiria_pereira_alves;deniseharue;daysianef;Dayanne_F;italo;Orlando;kauanny;marcelo_;bruna_jacob;adonescunha;fahbiuzinha;Barbabela;fale_rodrigo;michelle_rva;rlsleal12345;creuzamoura;tuliocfp;aeltonf;Cintia_Evelane;hellheize;dhelly;murilodumps;Indianara;thamy_ls;christiane_freire;JulianaGarbim;matheuslino;LMC;Gorrpo;guilhermevilela;gabi;dalekrause;vanessaformigosa;SCANDALL;elaine_regina;rafs_gomes;larylmacedo;erico;spencer;hitalos;daianefarias;rldourado;veronicacordeiro;carmemsmrocha;falcettijr;evertonera;nessa;vtcc;ricardoapjustino;leonardo;lopes21;marcelosantos;Verallucia;paolaseveroo
hugofsantiago;PE;daianefarias
kauanny;PE
Dayanne_F;PE;jwalker
guilhermepaiva;PE;marvinslap;jwalker;veronicacordeiro
italo;PE;marcoscampelo;jwalker;romero_britto;Syl;mariana_mbs
maria;PE
bruno;PE;marcoscampelo;jwalker;marvinslap;rldourado;anagloriaflor;dayannef;flmendes;adm_nathaly2010;Andersonpublicitario;diemesleno;misterobsom;jessica_soares_ribeiro;Orlando;cauecamacho;itl;kk_u;lucianaamancio;josi_patricia;mczmendonca;adonescunha;x5;lugentil;rafaelsantana;katarinebaf;aquista;analaurafonseca;auberio;naevio;mz;flaviooliveira;eduardocruz;robson_ribeiro;gabrielagrisa;fahbiuzinha;nattysilveira;petsabino;eduardovg88;bibc;ninecarvalho10;bjmm;marcossouza;masdesouza;espertinha3;valquiria_pereira_alves;narelle_moraes;rodrigo3n


And the output sample:

"marcelcaraciolo" [["Gileno", 0.44411764705882351], ["raphaelalex", 0.43999999999999995], ["marcossouza", 0.43611111111111112], ["roberta_gama", 0.42352941176470588], ["anagloriaflor", 0.40937499999999999], ["rodrigo3n", 0.40769230769230769], ["alissonpontes", 0.40459770114942528], ["andreza_cristina", 0.40370370370370368], ["naevio", 0.40370370370370368], ["adonescunha", 0.40327868852459015]]


Interesting among the top  10,  5 worked with me at Atepassar (it means lots of common friends).  Let's see the code:

#-*-coding: utf-8 -*-
'''
This module represents the FriendsRecommender system for recommending
new friends based on friendship similarity and state similarity.
'''
__author__ = 'Marcel Caraciolo <caraciol@gmail.com>'
from mrjob.job import MRJob
def combinations(iterable, r):
"""
Implementation of itertools combinations method.
Re-implemented here because of import issues in Amazon Elastic MapReduce.
Was just easier to do this thanbootstrap.
More info here:
http://docs.python.org/library/itertools.html#itertools.combinations
Input/Output:
combinations('ABCD', 2) --> AB AC AD BC BD CD
combinations(range(4), 3) --> 012 013 023 123
"""
pool = tuple(iterable)
n = len(pool)
if r > n:
return
indices = range(r)
yield tuple(pool[i] for i in indices)
while True:
for i in reversed(range(r)):
if indices[i] != i + n - r:
break
else:
return
indices[i] += 1
for j in range(i + 1, r):
indices[j] = indices[j - 1] + 1
yield tuple(pool[i] for i in indices)
TOP_N = 10
class FriendsRecommender(MRJob):
def steps(self):
return [self.mr(self.map_input, self.group_by_keys),
self.mr(None, self.score_similarities),
self.mr(None, self.top_recommendations)]
def map_input(self, key, line):
input = line.split(';')
user_id, state, friend_ids = input[0], input[1], input[2:]
for i in xrange(len(friend_ids)):
f1 = friend_ids[i]
yield ('f', f1), (user_id, 1, state, friend_ids)
if state != 'None':
yield ('s', state), (user_id, 1, state, friend_ids)
def group_by_keys(self, type_and_key, values):
final_list = list(values)
key_type, key = type_and_key
#pass friends through
if key_type == 'f':
for itemA, itemB in combinations(final_list, 2):
user_idA, countA, stateA, f_idsA = itemA
user_idB, countB, stateB, f_idsB = itemB
union = set(f_idsA).union(f_idsB)
if user_idB not in f_idsA:
yield (user_idA, user_idB), ('f', 1.0 / len(union))
if user_idA not in f_idsB:
yield (user_idB, user_idA), ('f', 1.0 / len(union))
return
assert key_type == 's'
for itemA, itemB in combinations(final_list, 2):
user_idA, countA, stateA, f_idsA = itemA
user_idB, countB, stateB, f_idsB = itemB
if user_idB not in f_idsA:
yield (user_idA, user_idB), ('s', 1.0)
if user_idA not in f_idsB:
yield (user_idB, user_idA), ('s', 1.0)
def score_similarities(self, key, values):
user_id, rec_id = key
f_total = 0.0
s_total = 0.0
similarities = (list(values))
for k, score in similarities:
if k == 'f':
f_total += score
else:
s_total += score
f_similarity = (0.7 * f_total) + (0.3 * s_total)
yield user_id, (rec_id, f_similarity)
def top_recommendations(self, key, values):
recommendations = []
for idx, (item, score) in enumerate(values):
recommendations.append((item, score))
yield key, sorted(recommendations, key=lambda k: -k[1])[:TOP_N]
if __name__ == '__main__':
FriendsRecommender.run()

Conclusions

That's a simple algorithm used at Atépassar for recommending friends using some basic graph analysis concepts.  You can extend this code or use it for free at your social network! Go ahead :)   In the next post I will present the map-reduce jobs for course recommendation analysis at PyCursos.

I hope you enjoyed this article,

Best regards,
Marcel Caraciolo
Page 1 of 36123...36 Next Page