mirror of
https://github.com/almet/notmyidea.git
synced 2025-04-28 19:42:37 +02:00
link suggestions
This commit is contained in:
parent
98ac302d6c
commit
4f5dd85b0b
1 changed files with 214 additions and 0 deletions
214
dev/link-suggestions.rst
Normal file
214
dev/link-suggestions.rst
Normal file
|
@ -0,0 +1,214 @@
|
||||||
|
Analyse users' browsing context to build up a web recommender
|
||||||
|
#############################################################
|
||||||
|
|
||||||
|
:date: 2011-04-01
|
||||||
|
:tags: recommendations, browsers, users
|
||||||
|
|
||||||
|
No, this is not an april's fool ;)
|
||||||
|
|
||||||
|
Wow, it's been a long time. My year in Oxford is going really well. I realized
|
||||||
|
few days ago that the end of the year is approaching really quickly.
|
||||||
|
Exams are coming in one month or such and then I'll be working full time on my dissertation topic.
|
||||||
|
|
||||||
|
When I learned we'll have about 6 month to work on something, I first thought
|
||||||
|
about doing a packaging related stuff, but finally decided to start something
|
||||||
|
new. After all, that's the good time to learn.
|
||||||
|
|
||||||
|
Since a long time, I'm being impressed by the `last.fm <http://last.fm>`_
|
||||||
|
recommender system. They're *scrobbling* the music I listen to since something
|
||||||
|
like 5 years now and the recommendations they're doing are really nice and
|
||||||
|
accurate (I discover **a lot** of new artists in my tastes listening to the
|
||||||
|
"neighbour radio".) (by the way, `here is <http://lastfm.com/user/akounet/`_
|
||||||
|
my lastfm account)
|
||||||
|
|
||||||
|
So I decided to work on recommender systems, to better understand what is it
|
||||||
|
about.
|
||||||
|
|
||||||
|
Recommender systems are usually used to increase the sales of products
|
||||||
|
(like Amazon.com does) which is not really what I'm looking for (The one who
|
||||||
|
know me a bit know I'm kind of sick about all this consumerism going on).
|
||||||
|
|
||||||
|
Actually, the most simple thing I thought of was the web: I'm browsing it quite
|
||||||
|
every day and each time new content appears. I've stopped to follow `my feed
|
||||||
|
reader <https://bitbucket.org/bruno/aspirator/>`_ because of the
|
||||||
|
information overload, and reduced drastically the number of people I follow `on
|
||||||
|
twitter <http://twitter.com/ametaireau/`_.
|
||||||
|
|
||||||
|
Too much information kills the information.
|
||||||
|
|
||||||
|
You shall got what will be my dissertation topic: a recommender system for
|
||||||
|
the web. Well, such recommender systems already exists, so I will try to add contextual
|
||||||
|
information to them: you're probably not interested by the same topics at different
|
||||||
|
times of the day, or depending on the computer you're using. We can also
|
||||||
|
probably make good use of the way you browse to create groups into the content
|
||||||
|
you're browsing (or even use the great firefox4 tab group feature).
|
||||||
|
|
||||||
|
There is a large part of concerns to have about user's privacy as well.
|
||||||
|
|
||||||
|
Here is my proposal (copy/pasted from the one I had to do for my master)
|
||||||
|
|
||||||
|
Introduction and rationale
|
||||||
|
==========================
|
||||||
|
|
||||||
|
Nowadays, people surf the web more and more often. New web pages are created
|
||||||
|
each day so the amount of information to retrieve is more important as the time
|
||||||
|
passes. These users uses the web in different contexts, from finding cooking
|
||||||
|
recipes to technical articles.
|
||||||
|
|
||||||
|
A lot of people share the same interest to various topics, and the quantity of
|
||||||
|
information is such than it's really hard to triage them efficiently without
|
||||||
|
spending hours doing it. Firstly because of the huge quantity of information
|
||||||
|
but also because the triage is something relative to each person. Although, this
|
||||||
|
triage can be facilitated by fetching the browsing information of all
|
||||||
|
particular individuals and put the in perspective.
|
||||||
|
|
||||||
|
Machine learning is a branch of Artificial Intelligence (AI) which deals with how
|
||||||
|
a program can learn from data. Recommendation systems are a particular
|
||||||
|
application area of machine learning which is able to recommend things (links
|
||||||
|
in our case) to the users, given a particular database containing the previous
|
||||||
|
choices users have made.
|
||||||
|
|
||||||
|
This browsing information is currently available in browsers. Even if it is not
|
||||||
|
in a very usable format, it is possible to transform it to something useful.
|
||||||
|
This information gold mine just wait to be used. Although, it is not as simple as
|
||||||
|
it can seems at the first approach: It is important to take care of the context
|
||||||
|
the user is in while browsing links. For instance, It's more likely that during
|
||||||
|
the day, a computer scientist will browse computing related links, and that during
|
||||||
|
the evening, he browse cooking recipes or something else.
|
||||||
|
|
||||||
|
Page contents are also interesting to analyse, because that's what people
|
||||||
|
browse and what actually contain the most interesting part of the information.
|
||||||
|
The raw data extracted from the browsing can then be translated into
|
||||||
|
something more useful (namely tags, type of resource, visit frequency,
|
||||||
|
navigation context etc.)
|
||||||
|
|
||||||
|
The goal of this dissertation is to create a recommender system for web links,
|
||||||
|
including this context information.
|
||||||
|
|
||||||
|
At the end of the dissertation, different pieces of software will be provided,
|
||||||
|
from raw data collection from the browser to a recommendation system.
|
||||||
|
|
||||||
|
Background Review
|
||||||
|
=================
|
||||||
|
|
||||||
|
This dissertation is mainly about data extraction, analysis and recommendation
|
||||||
|
systems. Two different research area can be isolated: Data preprocessing and
|
||||||
|
Information filtering.
|
||||||
|
|
||||||
|
The first step in order to make recommendations is to gather some data. The
|
||||||
|
more data we have available, the better it is (T. Segaran, 2007). This data can
|
||||||
|
be retrieved in various ways, one of them is to get it directly from user's
|
||||||
|
browsers.
|
||||||
|
|
||||||
|
Data preparation and extraction
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
The data gathered from browsers is basically URLs and additional information
|
||||||
|
about the context of the navigation. There is clearly a need to extract more
|
||||||
|
information about the meaning of the data the user is browsing, starting by the
|
||||||
|
content of the web pages.
|
||||||
|
|
||||||
|
Because the information provided on the current Web is not meant to be read by
|
||||||
|
machines (T. Berners Lee, 2001) there is a need of tools to extract meaning from
|
||||||
|
web pages. The information needs to be preprocessed before stored in a machine
|
||||||
|
readable format, allowing to make recommendations (Choochart et Al, 2004).
|
||||||
|
|
||||||
|
Data preparation is composed of two steps: cleaning and structuring (
|
||||||
|
Castellano et Al, 2007). Because raw data can contain a lot of un-needed text
|
||||||
|
(such as menus, headers etc.) and need to be cleaned prior to be stored.
|
||||||
|
Multiple techniques can be used here and belongs to boilerplate removal and
|
||||||
|
full text extraction (Kohlschütter et Al, 2010).
|
||||||
|
|
||||||
|
Then, structuring the information: category, type of content (news, blog, wiki)
|
||||||
|
can be extracted from raw data. This kind of information is not clearly defined
|
||||||
|
by HTML pages so there is a need of tools to recognise them.
|
||||||
|
|
||||||
|
Some context-related information can also be inferred from each resource. It can go
|
||||||
|
from the visit frequency to the navigation group the user was in while
|
||||||
|
browsing. It is also possible to determine if the user "liked" a resource, and
|
||||||
|
determine a mark for it, which can be used by information filtering a later
|
||||||
|
step (T. Segaran, 2007).
|
||||||
|
|
||||||
|
At this stage, structuring the data is required. Storing this kind of
|
||||||
|
information in RDBMS can be a bit tedious and require complex queries to get
|
||||||
|
back the data in an usable format. Graph databases can play a major role in the
|
||||||
|
simplification of information storage and querying.
|
||||||
|
|
||||||
|
Information filtering
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
To filter the information, three techniques can be used (Balabanovic et
|
||||||
|
Al, 1997):
|
||||||
|
|
||||||
|
* The content-based approach states that if an user have liked something in the
|
||||||
|
past, he is more likely to like similar things in the future. So it's about
|
||||||
|
establishing a profile for the user and compare new items against it.
|
||||||
|
* The collaborative approach will rather recommend items that other similar users
|
||||||
|
have liked. This approach consider only the relationship between users, and
|
||||||
|
not the profile of the user we are making recommendations to.
|
||||||
|
* the hybrid approach, which appeared recently combine both of the previous
|
||||||
|
approaches, giving recommendations when items score high regarding user's
|
||||||
|
profile, or if a similar user already liked it.
|
||||||
|
|
||||||
|
Grouping is also something to consider at this stage (G. Myatt, 2007).
|
||||||
|
Because we are dealing with huge amount of data, it can be useful to detect group
|
||||||
|
of data that can fit together. Data clustering is able to find such groups (T.
|
||||||
|
Segaran, 2007).
|
||||||
|
|
||||||
|
References:
|
||||||
|
|
||||||
|
* Balabanović, M., & Shoham, Y. (1997). Fab: content-based, collaborative
|
||||||
|
recommendation. Communications of the ACM, 40(3), 66–72. ACM.
|
||||||
|
Retrieved March 1, 2011, from http://portal.acm.org/citation.cfm?id=245108.245124&.
|
||||||
|
* Berners-Lee, T., Hendler, J., & Lassila, O. (2001).
|
||||||
|
The semantic web: Scientific american. Scientific American, 284(5), 34–43.
|
||||||
|
Retrieved November 21, 2010, from http://www.citeulike.org/group/222/article/1176986.
|
||||||
|
* Castellano, G., Fanelli, A., & Torsello, M. (2007).
|
||||||
|
LODAP: a LOg DAta Preprocessor for mining Web browsing patterns. Proceedings of the 6th Conference on 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases-Volume 6 (p. 12–17). World Scientific and Engineering Academy and Society (WSEAS). Retrieved March 8, 2011, from http://portal.acm.org/citation.cfm?id=1348485.1348488.
|
||||||
|
* Kohlschutter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate detection using shallow text features. Proceedings of the third ACM international conference on Web search and data mining (p. 441–450). ACM. Retrieved March 8, 2011, from http://portal.acm.org/citation.cfm?id=1718542.
|
||||||
|
* Myatt, G. J. (2007). Making Sense of Data: A Practical Guide to Exploratory
|
||||||
|
Data Analysis and Data Mining.
|
||||||
|
* Segaran, T. (2007). Collective Intelligence.
|
||||||
|
|
||||||
|
Privacy
|
||||||
|
=======
|
||||||
|
|
||||||
|
The first thing that's come to people minds when it comes to process their
|
||||||
|
browsing data is privacy. People don't want to be stalked. That's perfectly
|
||||||
|
right, and I don't either.
|
||||||
|
|
||||||
|
But such a system don't have to deal with people identities. It's completely
|
||||||
|
possible to process completely anonymous data, and that's probably what I'm
|
||||||
|
gonna do.
|
||||||
|
|
||||||
|
By the way, if you have interesting thoughts about that, if you do know
|
||||||
|
projects that do seems related, fire the comments !
|
||||||
|
|
||||||
|
What's the plan ?
|
||||||
|
=================
|
||||||
|
|
||||||
|
There is a lot of different things to explore, especially because I'm
|
||||||
|
a complete novice in that field.
|
||||||
|
|
||||||
|
* I want to develop a firefox plugin, to extract the browsing informations (
|
||||||
|
still, I need to know exactly which kind of informations to retrieve). The
|
||||||
|
idea is to provide some *raw* browsing data, and then to transform it and to
|
||||||
|
store it in the better possible way.
|
||||||
|
* Analyse how to store the informations in a graph database. What can be the
|
||||||
|
different methods to store this data and to visualize the relationship
|
||||||
|
between different pieces of data? How can I define the different contexts,
|
||||||
|
and add those informations in the db?
|
||||||
|
* Process the data using well known recommendation algorithms. Compare the
|
||||||
|
results and criticize their value.
|
||||||
|
|
||||||
|
There is plenty of stuff I want to try during this experimentation:
|
||||||
|
|
||||||
|
* I want to try using Geshi to visualize the connexion between the links,
|
||||||
|
and the contexts
|
||||||
|
* Try using graph databases such as Neo4j
|
||||||
|
* Having a deeper look at tools such as scikit.learn (a machine learning
|
||||||
|
toolkit in python)
|
||||||
|
* Analyse web pages in order to categorize them. Processing their
|
||||||
|
contents as well, to do some keyword based classification will be done.
|
||||||
|
|
||||||
|
Lot of work on its way, yay !
|
Loading…
Reference in a new issue