mirror of
https://github.com/almet/notmyidea.git
synced 2025-04-28 19:42:37 +02:00
348 lines
No EOL
17 KiB
HTML
348 lines
No EOL
17 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en">
|
||
<head>
|
||
<meta http-equiv="X-UA-Compatible" content="IE=edge">
|
||
<meta http-equiv="content-type" content="text/html; charset=utf-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">
|
||
<link rel="shortcut icon" type="image/x-icon" href="favicon.ico" />
|
||
|
||
<title>Analyse users' browsing context to build up a web recommender - Alexis - Carnets en ligne</title>
|
||
|
||
<meta charset="utf-8" />
|
||
<link href="https://blog.notmyidea.org/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Alexis - Carnets en ligne Full Atom Feed" />
|
||
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/poole.css"/>
|
||
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/syntax.css"/>
|
||
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/lanyon.css"/>
|
||
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">
|
||
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/styles.css"/>
|
||
|
||
|
||
|
||
<style>
|
||
|
||
h1 {
|
||
font-family: "Avant Garde", Avantgarde, "Century Gothic", CenturyGothic, "AppleGothic", sans-serif;
|
||
padding: 80px 50px;
|
||
text-align: center;
|
||
text-transform: uppercase;
|
||
text-rendering: optimizeLegibility;
|
||
color: #202020;
|
||
letter-spacing: .1em;
|
||
text-shadow:
|
||
-1px -1px 1px #111,
|
||
2px 2px 1px #eaeaea;
|
||
}
|
||
|
||
#main {
|
||
text-align: justify;
|
||
text-justify: inter-word;
|
||
}
|
||
#main h1 {
|
||
padding: 10px;
|
||
}
|
||
|
||
.post-headline {
|
||
padding: 15px;
|
||
}
|
||
</style>
|
||
</head>
|
||
|
||
<body>
|
||
<!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
|
||
styles, `#sidebar-checkbox` for behavior. -->
|
||
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
|
||
<!-- Toggleable sidebar -->
|
||
<div class="sidebar" id="sidebar">
|
||
<div class="sidebar-item">
|
||
<div class="profile">
|
||
<img src="https://blog.notmyidea.org/theme/img/profile.png"/>
|
||
</div>
|
||
</div>
|
||
|
||
<nav class="sidebar-nav">
|
||
<a class="sidebar-nav-item" href="/">Articles</a>
|
||
|
||
<a class="sidebar-nav-item" href="https://www.vieuxsinge.com">Brasserie du Vieux Singe</a>
|
||
<a class="sidebar-nav-item" href="http://blog.notmyidea.org/pages/about.html">A propos</a>
|
||
<a class="sidebar-nav-item" href="https://twitter.com/ametaireau">Messages courts</a>
|
||
<a class="sidebar-nav-item" href="https://github.com/almet">Code</a>
|
||
</nav>
|
||
</div> <div class="wrap">
|
||
<div class="masthead">
|
||
<div class="container">
|
||
<h3 class="masthead-title">
|
||
<a href="https://blog.notmyidea.org/" title="Home">Alexis - Carnets en ligne</a>
|
||
</h3>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="container content">
|
||
<div id="main" class="posts">
|
||
<h1 class="post-title">Analyse users' browsing context to build up a web recommender</h1>
|
||
<span class="post-date">01 avril 2011, dans <a class="no-color" href="category/technologie.html">Technologie</a></span>
|
||
<img id="illustration" src="" />
|
||
|
||
<div class="post article">
|
||
<div id="toc_container">
|
||
<div class="toc">
|
||
<ul>
|
||
<li><a href="#analyse-users-browsing-context-to-build-up-a-web-recommender">Analyse users' browsing context to build up a web recommender</a><ul>
|
||
<li><a href="#introduction-and-rationale">Introduction and rationale</a></li>
|
||
<li><a href="#background-review">Background Review</a><ul>
|
||
<li><a href="#data-preparation-and-extraction">Data preparation and extraction</a></li>
|
||
<li><a href="#information-filtering">Information filtering</a></li>
|
||
</ul>
|
||
</li>
|
||
<li><a href="#privacy">Privacy</a></li>
|
||
<li><a href="#whats-the-plan">What's the plan ?</a></li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
</div>
|
||
|
||
</div>
|
||
<h1>🌟</h1>
|
||
|
||
<p>No, this is not an april's fool ;)</p>
|
||
<p>Wow, it's been a long time. My year in Oxford is going really well. I
|
||
realized few days ago that the end of the year is approaching really
|
||
quickly. Exams are coming in one month or such and then I'll be working
|
||
full time on my dissertation topic.</p>
|
||
<p>When I learned we'll have about 6 month to work on something, I first
|
||
thought about doing a packaging related stuff, but finally decided to
|
||
start something new. After all, that's the good time to learn.</p>
|
||
<p>Since a long time, I'm being impressed by the <a href="http://last.fm">last.fm</a>
|
||
recommender system. They're <em>scrobbling</em> the music I listen to since
|
||
something like 5 years now and the recommendations they're doing are
|
||
really nice and accurate (I discovered <strong>a lot</strong> of great artists
|
||
listening to the "neighbour radio".) (by the way, <a href="http://lastfm.com/user/akounet/">here
|
||
is</a> my lastfm account)</p>
|
||
<p>So I decided to work on recommender systems, to better understand what
|
||
is it about.</p>
|
||
<p>Recommender systems are usually used to increase the sales of products
|
||
(like Amazon.com does) which is not really what I'm looking for (The one
|
||
who know me a bit know I'm kind of sick about all this consumerism going
|
||
on).</p>
|
||
<p>Actually, the most simple thing I thought of was the web: I'm browsing
|
||
it quite every day and each time new content appears. I've stopped to
|
||
follow <a href="https://bitbucket.org/bruno/aspirator/">my feed reader</a> because
|
||
of the information overload, and reduced drastically the number of
|
||
people I follow <a href="http://twitter.com/ametaireau/">on twitter</a>.</p>
|
||
<p>Too much information kills the information.</p>
|
||
<p>You shall got what will be my dissertation topic: a recommender system
|
||
for the web. Well, such recommender systems already exists, so I will
|
||
try to add contextual information to them: you're probably not
|
||
interested by the same topics at different times of the day, or
|
||
depending on the computer you're using. We can also probably make good
|
||
use of the way you browse to create groups into the content you're
|
||
browsing (or even use the great firefox4 tab group feature).</p>
|
||
<p>There is a large part of concerns to have about user's privacy as well.</p>
|
||
<p>Here is my proposal (copy/pasted from the one I had to do for my master)</p>
|
||
<h2 id="introduction-and-rationale">Introduction and rationale</h2>
|
||
<p>Nowadays, people surf the web more and more often. New web pages are
|
||
created each day so the amount of information to retrieve is more
|
||
important as the time passes. These users uses the web in different
|
||
contexts, from finding cooking recipes to technical articles.</p>
|
||
<p>A lot of people share the same interest to various topics, and the
|
||
quantity of information is such than it's really hard to triage them
|
||
efficiently without spending hours doing it. Firstly because of the huge
|
||
quantity of information but also because the triage is something
|
||
relative to each person. Although, this triage can be facilitated by
|
||
fetching the browsing information of all particular individuals and put
|
||
the in perspective.</p>
|
||
<p>Machine learning is a branch of Artificial Intelligence (AI) which deals
|
||
with how a program can learn from data. Recommendation systems are a
|
||
particular application area of machine learning which is able to
|
||
recommend things (links in our case) to the users, given a particular
|
||
database containing the previous choices users have made.</p>
|
||
<p>This browsing information is currently available in browsers. Even if it
|
||
is not in a very usable format, it is possible to transform it to
|
||
something useful. This information gold mine just wait to be used.
|
||
Although, it is not as simple as it can seems at the first approach: It
|
||
is important to take care of the context the user is in while browsing
|
||
links. For instance, It's more likely that during the day, a computer
|
||
scientist will browse computing related links, and that during the
|
||
evening, he browse cooking recipes or something else.</p>
|
||
<p>Page contents are also interesting to analyse, because that's what
|
||
people browse and what actually contain the most interesting part of the
|
||
information. The raw data extracted from the browsing can then be
|
||
translated into something more useful (namely tags, type of resource,
|
||
visit frequency, navigation context etc.)</p>
|
||
<p>The goal of this dissertation is to create a recommender system for web
|
||
links, including this context information.</p>
|
||
<p>At the end of the dissertation, different pieces of software will be
|
||
provided, from raw data collection from the browser to a recommendation
|
||
system.</p>
|
||
<h2 id="background-review">Background Review</h2>
|
||
<p>This dissertation is mainly about data extraction, analysis and
|
||
recommendation systems. Two different research area can be isolated:
|
||
Data preprocessing and Information filtering.</p>
|
||
<p>The first step in order to make recommendations is to gather some data.
|
||
The more data we have available, the better it is (T. Segaran, 2007).
|
||
This data can be retrieved in various ways, one of them is to get it
|
||
directly from user's browsers.</p>
|
||
<h3 id="data-preparation-and-extraction">Data preparation and extraction</h3>
|
||
<p>The data gathered from browsers is basically URLs and additional
|
||
information about the context of the navigation. There is clearly a need
|
||
to extract more information about the meaning of the data the user is
|
||
browsing, starting by the content of the web pages.</p>
|
||
<p>Because the information provided on the current Web is not meant to be
|
||
read by machines (T. Berners Lee, 2001) there is a need of tools to
|
||
extract meaning from web pages. The information needs to be preprocessed
|
||
before stored in a machine readable format, allowing to make
|
||
recommendations (Choochart et Al, 2004).</p>
|
||
<p>Data preparation is composed of two steps: cleaning and structuring (
|
||
Castellano et Al, 2007). Because raw data can contain a lot of un-needed
|
||
text (such as menus, headers etc.) and need to be cleaned prior to be
|
||
stored. Multiple techniques can be used here and belongs to boilerplate
|
||
removal and full text extraction (Kohlschütter et Al, 2010).</p>
|
||
<p>Then, structuring the information: category, type of content (news,
|
||
blog, wiki) can be extracted from raw data. This kind of information is
|
||
not clearly defined by HTML pages so there is a need of tools to
|
||
recognise them.</p>
|
||
<p>Some context-related information can also be inferred from each
|
||
resource. It can go from the visit frequency to the navigation group the
|
||
user was in while browsing. It is also possible to determine if the user
|
||
"liked" a resource, and determine a mark for it, which can be used by
|
||
information filtering a later step (T. Segaran, 2007).</p>
|
||
<p>At this stage, structuring the data is required. Storing this kind of
|
||
information in RDBMS can be a bit tedious and require complex queries to
|
||
get back the data in an usable format. Graph databases can play a major
|
||
role in the simplification of information storage and querying.</p>
|
||
<h3 id="information-filtering">Information filtering</h3>
|
||
<p>To filter the information, three techniques can be used (Balabanovic et
|
||
Al, 1997):</p>
|
||
<ul>
|
||
<li>The content-based approach states that if an user have liked
|
||
something in the past, he is more likely to like similar things in
|
||
the future. So it's about establishing a profile for the user and
|
||
compare new items against it.</li>
|
||
<li>The collaborative approach will rather recommend items that other
|
||
similar users have liked. This approach consider only the
|
||
relationship between users, and not the profile of the user we are
|
||
making recommendations to.</li>
|
||
<li>the hybrid approach, which appeared recently combine both of the
|
||
previous approaches, giving recommendations when items score high
|
||
regarding user's profile, or if a similar user already liked it.</li>
|
||
</ul>
|
||
<p>Grouping is also something to consider at this stage (G. Myatt, 2007).
|
||
Because we are dealing with huge amount of data, it can be useful to
|
||
detect group of data that can fit together. Data clustering is able to
|
||
find such groups (T. Segaran, 2007).</p>
|
||
<p>References:</p>
|
||
<ul>
|
||
<li>Balabanović, M., & Shoham, Y. (1997). Fab: content-based,
|
||
collaborative recommendation. Communications of the ACM, 40(3),
|
||
66–72. ACM. Retrieved March 1, 2011, from
|
||
<a href="http://portal.acm.org/citation.cfm?id=245108.245124&">http://portal.acm.org/citation.cfm?id=245108.245124&</a>;.</li>
|
||
<li>Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic
|
||
web: Scientific american. Scientific American, 284(5), 34–43.
|
||
Retrieved November 21, 2010, from
|
||
<a href="http://www.citeulike.org/group/222/article/1176986">http://www.citeulike.org/group/222/article/1176986</a>.</li>
|
||
<li>Castellano, G., Fanelli, A., & Torsello, M. (2007). LODAP: a LOg
|
||
DAta Preprocessor for mining Web browsing patterns. Proceedings of
|
||
the 6th Conference on 6th WSEAS Int. Conf. on Artificial
|
||
Intelligence, Knowledge Engineering and Data Bases-Volume 6 (p.
|
||
12–17). World Scientific and Engineering Academy and Society
|
||
(WSEAS). Retrieved March 8, 2011, from
|
||
<a href="http://portal.acm.org/citation.cfm?id=1348485.1348488">http://portal.acm.org/citation.cfm?id=1348485.1348488</a>.</li>
|
||
<li>Kohlschutter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate
|
||
detection using shallow text features. Proceedings of the third ACM
|
||
international conference on Web search and data mining (p. 441–450).
|
||
ACM. Retrieved March 8, 2011, from
|
||
<a href="http://portal.acm.org/citation.cfm?id=1718542">http://portal.acm.org/citation.cfm?id=1718542</a>.</li>
|
||
<li>Myatt, G. J. (2007). Making Sense of Data: A Practical Guide to
|
||
Exploratory Data Analysis and Data Mining.</li>
|
||
<li>Segaran, T. (2007). Collective Intelligence.</li>
|
||
</ul>
|
||
<h2 id="privacy">Privacy</h2>
|
||
<p>The first thing that's come to people minds when it comes to process
|
||
their browsing data is privacy. People don't want to be stalked. That's
|
||
perfectly right, and I don't either.</p>
|
||
<p>But such a system don't have to deal with people identities. It's
|
||
completely possible to process completely anonymous data, and that's
|
||
probably what I'm gonna do.</p>
|
||
<p>By the way, if you have interesting thoughts about that, if you do know
|
||
projects that do seems related, fire the comments !</p>
|
||
<h2 id="whats-the-plan">What's the plan ?</h2>
|
||
<p>There is a lot of different things to explore, especially because I'm a
|
||
complete novice in that field.</p>
|
||
<ul>
|
||
<li>I want to develop a firefox plugin, to extract the browsing
|
||
informations ( still, I need to know exactly which kind of
|
||
informations to retrieve). The idea is to provide some <em>raw</em>
|
||
browsing data, and then to transform it and to store it in the
|
||
better possible way.</li>
|
||
<li>Analyse how to store the informations in a graph database. What can
|
||
be the different methods to store this data and to visualize the
|
||
relationship between different pieces of data? How can I define the
|
||
different contexts, and add those informations in the db?</li>
|
||
<li>Process the data using well known recommendation algorithms. Compare
|
||
the results and criticize their value.</li>
|
||
</ul>
|
||
<p>There is plenty of stuff I want to try during this experimentation:</p>
|
||
<ul>
|
||
<li>I want to try using Geshi to visualize the connexion between the
|
||
links, and the contexts</li>
|
||
<li>Try using graph databases such as Neo4j</li>
|
||
<li>Having a deeper look at tools such as scikit.learn (a machine
|
||
learning toolkit in python)</li>
|
||
<li>Analyse web pages in order to categorize them. Processing their
|
||
contents as well, to do some keyword based classification will be
|
||
done.</li>
|
||
</ul>
|
||
<p>Lot of work on its way, yay !</p>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
|
||
<label for="sidebar-checkbox" class="sidebar-toggle"></label>
|
||
|
||
<script>
|
||
(function(document) {
|
||
var i = 0;
|
||
// snip empty header rows since markdown can't
|
||
var rows = document.querySelectorAll('tr');
|
||
for(i=0; i<rows.length; i++) {
|
||
var ths = rows[i].querySelectorAll('th');
|
||
var rowlen = rows[i].children.length;
|
||
if (ths.length > 0 && ths.length === rowlen) {
|
||
rows[i].remove();
|
||
}
|
||
}
|
||
})(document);
|
||
</script>
|
||
|
||
<script>
|
||
/* Lanyon & Poole are Copyright (c) 2014 Mark Otto. Adapted to Pelican 20141223 and extended a bit by @thomaswilley */
|
||
(function(document) {
|
||
var toggle = document.querySelector('.sidebar-toggle');
|
||
var sidebar = document.querySelector('#sidebar');
|
||
var checkbox = document.querySelector('#sidebar-checkbox');
|
||
document.addEventListener('click', function(e) {
|
||
var target = e.target;
|
||
if(!checkbox.checked ||
|
||
sidebar.contains(target) ||
|
||
(target === checkbox || target === toggle)) return;
|
||
checkbox.checked = false;
|
||
}, false);
|
||
})(document);
|
||
</script>
|
||
<!-- Piwik -->
|
||
<script type="text/javascript">
|
||
var _paq = _paq || [];
|
||
_paq.push(['trackPageView']);
|
||
_paq.push(['enableLinkTracking']);
|
||
(function() {
|
||
var u="//tracker.notmyidea.org/";
|
||
_paq.push(['setTrackerUrl', u+'piwik.php']);
|
||
_paq.push(['setSiteId', 3]);
|
||
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
|
||
g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'piwik.js'; s.parentNode.insertBefore(g,s);
|
||
})();
|
||
</script>
|
||
<noscript><p><img src="//tracker.notmyidea.org/piwik.php?idsite=3" style="border:0;" alt="" /></p></noscript>
|
||
<!-- End Piwik Code -->
|
||
</div>
|
||
</body>
|
||
</html> |