Hi all,
In this post I'd like to introduce another approach for recommender engines using graph concepts to recommend novel and interesting items. I will build a graph-based how-to tutorials recommender engine using the data available on the website SnapGuide (By the way I am a huge fan and user of this tutorials website), the graph database Neo4J and the graph traversal language Gremlin.
Snapguide is a web service for anyone who wants to create and share step-by-step "how to guides". It is available on the web and IOS app. There you can find several tutorials with easy visual instructions for a wide array of topics including cooking, gardening, crafts, projects, fashion tips and more. It is free and anyone is invitide to submit guides in order to share their passions and expertise with the community. I have extracted from their website for only research purposes the corpus of tutorials likes. Several users may like the tutorial and this signal can be quite useful to recommend similar tutorials based on what other users liked. Unfortunately I can't provide the dataset for download but the code you can follow below for your own data set.
To create your own graph with Neo4J you will need to use Java/Groovy to explore it. I found Bulbflow, it is a open-source Python ORM for graph databases and supports puggable backends using Blueprints standards. In this post I used it to connect to Neo4j Servers. The snippet code below is a simple example of Bulbflow in action by creating some edges and vertexes.
I decided to define my graph schema in order to map the raw data into a property graph so the traversals required to get recommendations of which tutorials to check could be natural as possible.
SnapGuide Graph Schema |
The data will be inserted into the graph database Neo4J The code belows creates a new Neo4J graph with all the data set.
There are three input files: tutorials.dat, users.dat and likes.dat. The file tutorials.dat contains the list of tutorials. Each row has 2 columns: tutorialId, title and category. The file users.dat contains the list of users. Each row contains the columns: userID, user name. Finally the likes.dat includes the tutorials that a user marked their interest. Each row of the raw file has : userId and movieId.
Given that there are more than 1 million likes, it will take some time to process all the data. An important note before going on. Don't forget to create the vertices indexes, if you forget your queries it will take ages to proccess.
- //These indexes are a must, otherwise querying the graph database will take so looong
- g.createKeyIndex('userId',Vertex.class)
- g.createKeyIndex('tutorialId',Vertex.class)
- g.createKeyIndex('category',Vertex.class)
- g.createKeyIndex('title',Vertex.class)
Before moving on to recommender algorithms, let's make sure the graph is ok.
For instance, what is the distribution of keywords amongst the tutorials repository ?
- //Distribution frequency of Categoriesdef dist_categories(){m = [:]g.V.filter{it.getProperty('type')=='Tutorial'}.out('hasCategory').category.groupCount(m).iterate()return m.sort{-it.value}}
What about the average number of likes per tutorial ?
- //Get the average number of likes per tutorial
def avg_likes(){
return
g.V.filter{it.getProperty('type')=='Tutorial'}.transform{it.in('liked').count()}.mean()
}
Trasversing the Tutorials Graph
Now that the data is represented as graph, let's make some queries. Behind the scene what we make are some traversals. In recommender systems there are two general typs of recommendation approaches: the collaborative filtering and content-based one.
In collaborative, the liking behavior of users is correlated in order to recommend the favorites of one user to another, in this case let's find the similar user.
I like the tutorials Amanda preferred, what other tutorials does Amanda like that I haven't seen ?
Otherwise, the content-base strategy is based on the features of a recommendable item. So the attributes are analyzed in order to find other similar items with analogous features.
I like food tutorials, what other food tutorials are there ?
Making Recommendations
Let's begin with collaborative filtering. I will use some complex traversal queries at our graph. Let's start with the tutorial: "How to Make Sous Vide Chicken at Home". Yes, I love chicken! :)
Great dish by the way! |
Which users liked Make Sous Vide Chicken at Home ?
- //Get the users who liked a tutorial
def users_liked(tutorial){
v = g.V.filter{it.getPropery('title') == tutorial}
return v.inE('liked').outV.userId[0..4]
}
Which users liked Make Sous Vide Chicken at Home and what other tutorials did they liked most in common to ?
- //Get the users who liked the tutorial and what other tutorials did they like too ?
def similar_tutorials(tutorial){
v = g.V.filter{it.getProperty('title') == tutorial}
return v.inE('liked').outV.outE('liked').inV.title[0..4]
}
What is the query above express ?
It filters all users that liked the tutorial (inE('liked')) and find out what they liked (outV.outE('liked')), fetching the title of those tutorials (inV.title) . It returns the first five items ([0..4])
In recommendations we have to find the most-common purchased or liked itens. Using Gremlin, we can work on a simple collaborative filtering algorithm by joining several steps together.
- //Get similar tutorials
def topMatches(tutorial){
m = [:]
v = g.V.filter{it.getProperty('title') == tutorial}
v.inE('liked').outV.outE('liked').inV.title.groupCount(m).iterate()
return m.sort{-it.value}[0..9]
}
This traversal will return a list of tutorials. But you may notice if you get all matches, ther are many duplicates. It happens because who like How to Make sous Vide Chicken At Home also like many of the same other tutorials. The similarity between users in represented at collaborative filtering algorithms.
How many of How to Make sous Vide Chicken At Home highly correlated tutorials are unique ?
- //Get the number of unique similar tutorials
def n_similar_unique_tutorials(tutorial){
v = g.V.filter{it.title == tutorial}
return v.inE('liked').outV.outE('liked').inV.dedup.count()
}
//Get the number of similar tutorials
def n_similar_tutorials(tutorial){
v = g.V.filter{it.getProperty('title') == tutorial}
return v.inE('liked').outV.outE('liked').inV.count()
}
There are 37323 paths from Make Sous Vide Chicken at Home to other tutorials and only 8766 of those tutorials are unique. Using this information we can use these duplications to build a ranking mechanism to build recommendations.
Which tutorials are most highly co-rated with How to Make Soous Vide Chicken ?
So we have the top similar tutorials. It means, people who like Make Sous Vide Chicken at Home, also like Make Sous Viden Chicken at Home, oops! Let's remove these reflexive paths, by filtering out the Sous Viden Chicken.
The recommendation above starts from a particular tutorial (i.e. Make Sous Vide Chicken), not from a particular user. This collaborative filtering method is called item-based filtering.
- //Get similar tutorials
def topUniqueMatches(tutorial){
m = [:]
v = g.V.filter{it.getProperty('title') == tutorial}
possible_tutorials = v.inE('liked').outV.outE('liked').inV
possible_tutorials.hasNot('title',tutorial).title.groupCount(m).iterate()
return m.sort{-it.value}[0..9]
}
Given an tutorial that a user likes, who else like this tutorial, and from those what other tutorials do they like that are not already liked by the initial user.
And the recommendation for a particular user ? That comes the user-based filtering.
And the recommendation for a particular user ? That comes the user-based filtering.
Which tutorials that similar users liked are recommended given a specified user ?
- def userRecommendations(user){
m = [:]
v = g.V.filter{it.getProperty('userId') == user}
v.out('liked').aggregate(x).in('liked').dedup.out('liked').except(x).title.groupCount(m).iterate()
return m.sort{-it.value}[0..9]
}
Emma Rushin will really like art and crafts suggestions! :D
Ok, we have interesting recommendations, but if I desire to make another styles of chicken like Chicken Ramen Soup for my dinner, I probably do not want some tutorial of How to Solve a Rubik Cube 3x3. To adapt to this situation, it is possible to mix collaborative filtering and content-based recommendation into a traversal so it would recommend similar chicken and food tutorials based on similar keywords.
Ok, we have interesting recommendations, but if I desire to make another styles of chicken like Chicken Ramen Soup for my dinner, I probably do not want some tutorial of How to Solve a Rubik Cube 3x3. To adapt to this situation, it is possible to mix collaborative filtering and content-based recommendation into a traversal so it would recommend similar chicken and food tutorials based on similar keywords.
Now let's play with content-based recommendation!
Which tutorials are most highly correlated with Sous Vide Chicken that share the same category of food?
- //Top recommendations mixing content + collaborative sharing all categories.
def topRecommendations(tutorial){
m = [:]
x = [] as Set
v = g.V.filter{it.getProperty('title') == tutorial}
tuts =v.out('hasCategory').aggregate(x).back(2).inE('liked').outV.outE('liked').inV
tuts.hasNot('title',tutorial).out('hasCategory').retain(x).back(2).title.groupCount(m).iterate()
return m.sort{-it.value}[0..9]
}
This rank makes sense, but it still has a flaw. The tutorial like Make mint Juleps may not be interesting for me. How about only considering those tutorials that share the same keyword 'chicken' with Vide Chicken ?
Which tutorials are most highly co-rated with Vide Chicken that share the same keyword 'chicken' with Vide Chicken?
Which tutorials are most highly co-rated with Vide Chicken that share the same keyword 'chicken' with Vide Chicken?
- //Top recommendations mixing content + collaborative sharing the chicken category.def topRecommendations(tutorial){m = [:]v = g.V.filter{it.getProperty('title') == tutorial}
v.inE('liked').outV.outE('liked').inV.hasNot('title',tutorial).out('hasCategory').
has('category' ,'chicken').back(2).title.groupCount(m).iterate()
return m.sort{-it.value}[0..9]}
Conclusions
In this post I presented one strategy for recommending items using graph concepts. What I explored here is the flexibility of the property graph data structure and the notion of derived and inferred relationships. This strategy could be further explored to use other features available at your dataset (I will be sure that SnapGuide has more rich information to use such as Age, sex and the category taxonomy). I am working on a book for recommender systems and I will explain with more details about graph based recommendations, so stay tunned at my blog!
The performance ? Ok, I didn't test in order to compare with the current solutions nowadays. What I can say is that Neo4J can theoretically hold billions entities (vertices + edges) and the Gremlin makes possible advanced queries. I will perform some tests, but based on what I studied, depending on the complexity of the the graph structure, runtimes vary.
I also would like to thank Marko Rodriguez with his help at the Grenlim-Users community with his post to inspire me to take a further look into Neo4J + Grenlim! It amazed me! :)
Marcel Caraciolo
Interesting
ReplyDeletethx!
ReplyDeleteThis special issue will present recent advances in the theory and successful application of artificial intelligence and knowledge computing approaches in fields such as medicine, biology, healthcare, education, agriculture, business, etc. speech recognition software
ReplyDeleteWiztech Automation is the Leading Best quality PLC, Scada, DCS, Embedded, VLSI, PLC Automation Training Centre in Chennai. Wiztech’s Industrial PLC Training and the R & D Lab are fully equipped to provide through conceptual and practical knowledge aspects with hands on experience to its students.
ReplyDeletePLC training in Chennai
PLC training institute in Chennai
PLC training centre in Chennai
PLC, SCADA training in Chennai
Automation training in Chennai
DCS training in Chennai
Great Article...thank you..
ReplyDeletePLC training in Cochin, Kerala
Automation training in Cochin, Kerala
Embedded System training in Cochin, Kerala
VLSI training in Cochin, Kerala
PLC training institute in Cochin, Kerala
Embedded training in Cochin, Kerala
Best plc training in Cochin, Kerala
This professional hacker is absolutely reliable and I strongly recommend him for any type of hack you require. I know this because I have hired him severally for various hacks and he has never disappointed me nor any of my friends who have hired him too, he can help you with any of the following hacks:
ReplyDelete-Phone hacks (remotely)
-Credit repair
-Bitcoin recovery (any cryptocurrency)
-Make money from home (USA only)
-Social media hacks
-Website hacks
-Erase criminal records (USA & Canada only)
-Grade change
-funds recovery
Email: onlineghosthacker247@ gmail .com
I like your post. I appreciate your blogs because they are really good. Please go to this website for the Data Science Course: Data Science course in Bangalore. These courses are wonderful for professionalism.
Very good contribution to the blog.
ReplyDelete