blog.notmyidea.org/analyse-users-browsing-context-to-build-up-a-web-recommender.html

<!DOCTYPE html>
<html lang="en">

<head>
    <title>Analyse users&#8217; browsing context to build up a web&nbsp;recommender - Alexis Métaireau</title>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/main.css" type="text/css" />
    <link href="https://blog.notmyidea.org/feeds/all.atom.xml" type="application/atom+xml" rel="alternate"
        title="Alexis Métaireau ATOM Feed" />
</head>

<body>
    <section id="links">
        <li><a class=""
            href="https://blog.notmyidea.org/">Alexis Métaireau</a></li>
        <li><a class=""
            href="https://blog.notmyidea.org/journal/index.html">Journal</a></li>
        <li><a class=""
            href="https://blog.notmyidea.org/code/">Code, etc.</a></li>
        <li><a class=""
            href="https://blog.notmyidea.org/weeknotes/">Notes hebdo</a></li>
        <li><a class=""
            href="https://blog.notmyidea.org/lectures/">Lectures</a></li>
    </section>

<header>
	<h1 class="post-title">Analyse users&#8217; browsing context to build up a web&nbsp;recommender</h1>
	<time datetime="2011-04-01T00:00:00+02:00">01 avril 2011</time>


</header>
<article>

<p>No, this is not an april&#8217;s fool&nbsp;;)</p>
<p>Wow, it&#8217;s been a long time. My year in Oxford is going really well. I
realized few days ago that the end of the year is approaching really
quickly. Exams are coming in one month or such and then I&#8217;ll be working
full time on my dissertation&nbsp;topic.</p>
<p>When I learned we&#8217;ll have about 6 month to work on something, I first
thought about doing a packaging related stuff, but finally decided to
start something new. After all, that&#8217;s the good time to&nbsp;learn.</p>
<p>Since a long time, I&#8217;m being impressed by the <a href="http://last.fm">last.fm</a>
recommender system. They&#8217;re <em>scrobbling</em> the music I listen to since
something like 5 years now and the recommendations they&#8217;re doing are
really nice and accurate (I discovered <strong>a lot</strong> of great artists
listening to the &#8220;neighbour radio&#8221;.) (by the way, <a href="http://lastfm.com/user/akounet/">here
is</a> my lastfm&nbsp;account)</p>
<p>So I decided to work on recommender systems, to better understand what
is it&nbsp;about.</p>
<p>Recommender systems are usually used to increase the sales of products
(like Amazon.com does) which is not really what I&#8217;m looking for (The one
who know me a bit know I&#8217;m kind of sick about all this consumerism going&nbsp;on).</p>
<p>Actually, the most simple thing I thought of was the web: I&#8217;m browsing
it quite every day and each time new content appears. I&#8217;ve stopped to
follow <a href="https://bitbucket.org/bruno/aspirator/">my feed reader</a> because
of the information overload, and reduced drastically the number of
people I follow <a href="http://twitter.com/ametaireau/">on twitter</a>.</p>
<p>Too much information kills the&nbsp;information.</p>
<p>You shall got what will be my dissertation topic: a recommender system
for the web. Well, such recommender systems already exists, so I will
try to add contextual information to them: you&#8217;re probably not
interested by the same topics at different times of the day, or
depending on the computer you&#8217;re using. We can also probably make good
use of the way you browse to create groups into the content you&#8217;re
browsing (or even use the great firefox4 tab group&nbsp;feature).</p>
<p>There is a large part of concerns to have about user&#8217;s privacy as&nbsp;well.</p>
<p>Here is my proposal (copy/pasted from the one I had to do for my&nbsp;master)</p>
<h2 id="introduction-and-rationale">Introduction and&nbsp;rationale</h2>
<p>Nowadays, people surf the web more and more often. New web pages are
created each day so the amount of information to retrieve is more
important as the time passes. These users uses the web in different
contexts, from finding cooking recipes to technical&nbsp;articles.</p>
<p>A lot of people share the same interest to various topics, and the
quantity of information is such than it&#8217;s really hard to triage them
efficiently without spending hours doing it. Firstly because of the huge
quantity of information but also because the triage is something
relative to each person. Although, this triage can be facilitated by
fetching the browsing information of all particular individuals and put
the in&nbsp;perspective.</p>
<p>Machine learning is a branch of Artificial Intelligence (<span class="caps">AI</span>) which deals
with how a program can learn from data. Recommendation systems are a
particular application area of machine learning which is able to
recommend things (links in our case) to the users, given a particular
database containing the previous choices users have&nbsp;made.</p>
<p>This browsing information is currently available in browsers. Even if it
is not in a very usable format, it is possible to transform it to
something useful. This information gold mine just wait to be used.
Although, it is not as simple as it can seems at the first approach: It
is important to take care of the context the user is in while browsing
links. For instance, It&#8217;s more likely that during the day, a computer
scientist will browse computing related links, and that during the
evening, he browse cooking recipes or something&nbsp;else.</p>
<p>Page contents are also interesting to analyse, because that&#8217;s what
people browse and what actually contain the most interesting part of the
information. The raw data extracted from the browsing can then be
translated into something more useful (namely tags, type of resource,
visit frequency, navigation context&nbsp;etc.)</p>
<p>The goal of this dissertation is to create a recommender system for web
links, including this context&nbsp;information.</p>
<p>At the end of the dissertation, different pieces of software will be
provided, from raw data collection from the browser to a recommendation&nbsp;system.</p>
<h2 id="background-review">Background&nbsp;Review</h2>
<p>This dissertation is mainly about data extraction, analysis and
recommendation systems. Two different research area can be isolated:
Data preprocessing and Information&nbsp;filtering.</p>
<p>The first step in order to make recommendations is to gather some data.
The more data we have available, the better it is (T. Segaran, 2007).
This data can be retrieved in various ways, one of them is to get it
directly from user&#8217;s&nbsp;browsers.</p>
<h3 id="data-preparation-and-extraction">Data preparation and&nbsp;extraction</h3>
<p>The data gathered from browsers is basically URLs and additional
information about the context of the navigation. There is clearly a need
to extract more information about the meaning of the data the user is
browsing, starting by the content of the web&nbsp;pages.</p>
<p>Because the information provided on the current Web is not meant to be
read by machines (T. Berners Lee, 2001) there is a need of tools to
extract meaning from web pages. The information needs to be preprocessed
before stored in a machine readable format, allowing to make
recommendations (Choochart et Al,&nbsp;2004).</p>
<p>Data preparation is composed of two steps: cleaning and structuring (
Castellano et Al, 2007). Because raw data can contain a lot of un-needed
text (such as menus, headers etc.) and need to be cleaned prior to be
stored. Multiple techniques can be used here and belongs to boilerplate
removal and full text extraction (Kohlschütter et Al,&nbsp;2010).</p>
<p>Then, structuring the information: category, type of content (news,
blog, wiki) can be extracted from raw data. This kind of information is
not clearly defined by <span class="caps">HTML</span> pages so there is a need of tools to
recognise&nbsp;them.</p>
<p>Some context-related information can also be inferred from each
resource. It can go from the visit frequency to the navigation group the
user was in while browsing. It is also possible to determine if the user
&#8220;liked&#8221; a resource, and determine a mark for it, which can be used by
information filtering a later step (T. Segaran,&nbsp;2007).</p>
<p>At this stage, structuring the data is required. Storing this kind of
information in <span class="caps">RDBMS</span> can be a bit tedious and require complex queries to
get back the data in an usable format. Graph databases can play a major
role in the simplification of information storage and&nbsp;querying.</p>
<h3 id="information-filtering">Information&nbsp;filtering</h3>
<p>To filter the information, three techniques can be used (Balabanovic et
Al,&nbsp;1997):</p>
<ul>
<li>The content-based approach states that if an user have liked
    something in the past, he is more likely to like similar things in
    the future. So it&#8217;s about establishing a profile for the user and
    compare new items against&nbsp;it.</li>
<li>The collaborative approach will rather recommend items that other
    similar users have liked. This approach consider only the
    relationship between users, and not the profile of the user we are
    making recommendations&nbsp;to.</li>
<li>the hybrid approach, which appeared recently combine both of the
    previous approaches, giving recommendations when items score high
    regarding user&#8217;s profile, or if a similar user already liked&nbsp;it.</li>
</ul>
<p>Grouping is also something to consider at this stage (G. Myatt, 2007).
Because we are dealing with huge amount of data, it can be useful to
detect group of data that can fit together. Data clustering is able to
find such groups (T. Segaran,&nbsp;2007).</p>
<p>References:</p>
<ul>
<li>Balabanović, M., <span class="amp">&amp;</span> Shoham, Y. (1997). Fab: content-based,
    collaborative recommendation. Communications of the <span class="caps">ACM</span>, 40(3),
    66–72. <span class="caps">ACM</span>. Retrieved March 1, 2011, from
    <a href="http://portal.acm.org/citation.cfm?id=245108.245124&amp;">http://portal.acm.org/citation.cfm?id=245108.245124&amp;</a>;.</li>
<li>Berners-Lee, T., Hendler, J., <span class="amp">&amp;</span> Lassila, O. (2001). The semantic
    web: Scientific american. Scientific American, 284(5), 34–43.
    Retrieved November 21, 2010, from
    <a href="http://www.citeulike.org/group/222/article/1176986">http://www.citeulike.org/group/222/article/1176986</a>.</li>
<li>Castellano, G., Fanelli, A., <span class="amp">&amp;</span> Torsello, M. (2007). <span class="caps">LODAP</span>: a LOg
    DAta Preprocessor for mining Web browsing patterns. Proceedings of
    the 6th Conference on 6th <span class="caps">WSEAS</span> Int. Conf. on Artificial
    Intelligence, Knowledge Engineering and Data Bases-Volume 6 (p.
    12–17). World Scientific and Engineering Academy and Society
    (<span class="caps">WSEAS</span>). Retrieved March 8, 2011, from
    <a href="http://portal.acm.org/citation.cfm?id=1348485.1348488">http://portal.acm.org/citation.cfm?id=1348485.1348488</a>.</li>
<li>Kohlschutter, C., Fankhauser, P., <span class="amp">&amp;</span> Nejdl, W. (2010). Boilerplate
    detection using shallow text features. Proceedings of the third <span class="caps">ACM</span>
    international conference on Web search and data mining (p. 441–450).
    <span class="caps">ACM</span>. Retrieved March 8, 2011, from
    <a href="http://portal.acm.org/citation.cfm?id=1718542">http://portal.acm.org/citation.cfm?id=1718542</a>.</li>
<li>Myatt, <span class="caps">G. J.</span>(2007). Making Sense of Data: A Practical Guide to
    Exploratory Data Analysis and Data&nbsp;Mining.</li>
<li>Segaran, T. (2007). Collective&nbsp;Intelligence.</li>
</ul>
<h2 id="privacy">Privacy</h2>
<p>The first thing that&#8217;s come to people minds when it comes to process
their browsing data is privacy. People don&#8217;t want to be stalked. That&#8217;s
perfectly right, and I don&#8217;t&nbsp;either.</p>
<p>But such a system don&#8217;t have to deal with people identities. It&#8217;s
completely possible to process completely anonymous data, and that&#8217;s
probably what I&#8217;m gonna&nbsp;do.</p>
<p>By the way, if you have interesting thoughts about that, if you do know
projects that do seems related, fire the comments&nbsp;!</p>
<h2 id="whats-the-plan">What&#8217;s the plan&nbsp;?</h2>
<p>There is a lot of different things to explore, especially because I&#8217;m a
complete novice in that&nbsp;field.</p>
<ul>
<li>I want to develop a firefox plugin, to extract the browsing
    informations ( still, I need to know exactly which kind of
    informations to retrieve). The idea is to provide some <em>raw</em>
    browsing data, and then to transform it and to store it in the
    better possible&nbsp;way.</li>
<li>Analyse how to store the informations in a graph database. What can
    be the different methods to store this data and to visualize the
    relationship between different pieces of data? How can I define the
    different contexts, and add those informations in the&nbsp;db?</li>
<li>Process the data using well known recommendation algorithms. Compare
    the results and criticize their&nbsp;value.</li>
</ul>
<p>There is plenty of stuff I want to try during this&nbsp;experimentation:</p>
<ul>
<li>I want to try using Geshi to visualize the connexion between the
    links, and the&nbsp;contexts</li>
<li>Try using graph databases such as&nbsp;Neo4j</li>
<li>Having a deeper look at tools such as scikit.learn (a machine
    learning toolkit in&nbsp;python)</li>
<li>Analyse web pages in order to categorize them. Processing their
    contents as well, to do some keyword based classification will be&nbsp;done.</li>
</ul>
<p>Lot of work on its way, yay&nbsp;!</p>
</article>

</body>

</html>