blog.notmyidea.org/content/Technologie/2011-04-01-Link-suggestions.md

# Analyse users' browsing context to build up a web recommender


No, this is not an april's fool ;)

Wow, it's been a long time. My year in Oxford is going really well. I
realized few days ago that the end of the year is approaching really
quickly. Exams are coming in one month or such and then I'll be working
full time on my dissertation topic.

When I learned we'll have about 6 month to work on something, I first
thought about doing a packaging related stuff, but finally decided to
start something new. After all, that's the good time to learn.

Since a long time, I'm being impressed by the [last.fm](http://last.fm)
recommender system. They're *scrobbling* the music I listen to since
something like 5 years now and the recommendations they're doing are
really nice and accurate (I discovered **a lot** of great artists
listening to the "neighbour radio".) (by the way, [here
is](http://lastfm.com/user/akounet/) my lastfm account)

So I decided to work on recommender systems, to better understand what
is it about.

Recommender systems are usually used to increase the sales of products
(like Amazon.com does) which is not really what I'm looking for (The one
who know me a bit know I'm kind of sick about all this consumerism going
on).

Actually, the most simple thing I thought of was the web: I'm browsing
it quite every day and each time new content appears. I've stopped to
follow [my feed reader](https://bitbucket.org/bruno/aspirator/) because
of the information overload, and reduced drastically the number of
people I follow [on twitter](http://twitter.com/ametaireau/).

Too much information kills the information.

You shall got what will be my dissertation topic: a recommender system
for the web. Well, such recommender systems already exists, so I will
try to add contextual information to them: you're probably not
interested by the same topics at different times of the day, or
depending on the computer you're using. We can also probably make good
use of the way you browse to create groups into the content you're
browsing (or even use the great firefox4 tab group feature).

There is a large part of concerns to have about user's privacy as well.

Here is my proposal (copy/pasted from the one I had to do for my master)

## Introduction and rationale

Nowadays, people surf the web more and more often. New web pages are
created each day so the amount of information to retrieve is more
important as the time passes. These users uses the web in different
contexts, from finding cooking recipes to technical articles.

A lot of people share the same interest to various topics, and the
quantity of information is such than it's really hard to triage them
efficiently without spending hours doing it. Firstly because of the huge
quantity of information but also because the triage is something
relative to each person. Although, this triage can be facilitated by
fetching the browsing information of all particular individuals and put
the in perspective.

Machine learning is a branch of Artificial Intelligence (AI) which deals
with how a program can learn from data. Recommendation systems are a
particular application area of machine learning which is able to
recommend things (links in our case) to the users, given a particular
database containing the previous choices users have made.

This browsing information is currently available in browsers. Even if it
is not in a very usable format, it is possible to transform it to
something useful. This information gold mine just wait to be used.
Although, it is not as simple as it can seems at the first approach: It
is important to take care of the context the user is in while browsing
links. For instance, It's more likely that during the day, a computer
scientist will browse computing related links, and that during the
evening, he browse cooking recipes or something else.

Page contents are also interesting to analyse, because that's what
people browse and what actually contain the most interesting part of the
information. The raw data extracted from the browsing can then be
translated into something more useful (namely tags, type of resource,
visit frequency, navigation context etc.)

The goal of this dissertation is to create a recommender system for web
links, including this context information.

At the end of the dissertation, different pieces of software will be
provided, from raw data collection from the browser to a recommendation
system.

## Background Review

This dissertation is mainly about data extraction, analysis and
recommendation systems. Two different research area can be isolated:
Data preprocessing and Information filtering.

The first step in order to make recommendations is to gather some data.
The more data we have available, the better it is (T. Segaran, 2007).
This data can be retrieved in various ways, one of them is to get it
directly from user's browsers.

### Data preparation and extraction

The data gathered from browsers is basically URLs and additional
information about the context of the navigation. There is clearly a need
to extract more information about the meaning of the data the user is
browsing, starting by the content of the web pages.

Because the information provided on the current Web is not meant to be
read by machines (T. Berners Lee, 2001) there is a need of tools to
extract meaning from web pages. The information needs to be preprocessed
before stored in a machine readable format, allowing to make
recommendations (Choochart et Al, 2004).

Data preparation is composed of two steps: cleaning and structuring (
Castellano et Al, 2007). Because raw data can contain a lot of un-needed
text (such as menus, headers etc.) and need to be cleaned prior to be
stored. Multiple techniques can be used here and belongs to boilerplate
removal and full text extraction (Kohlschütter et Al, 2010).

Then, structuring the information: category, type of content (news,
blog, wiki) can be extracted from raw data. This kind of information is
not clearly defined by HTML pages so there is a need of tools to
recognise them.

Some context-related information can also be inferred from each
resource. It can go from the visit frequency to the navigation group the
user was in while browsing. It is also possible to determine if the user
"liked" a resource, and determine a mark for it, which can be used by
information filtering a later step (T. Segaran, 2007).

At this stage, structuring the data is required. Storing this kind of
information in RDBMS can be a bit tedious and require complex queries to
get back the data in an usable format. Graph databases can play a major
role in the simplification of information storage and querying.

### Information filtering

To filter the information, three techniques can be used (Balabanovic et
Al, 1997):

  - The content-based approach states that if an user have liked
    something in the past, he is more likely to like similar things in
    the future. So it's about establishing a profile for the user and
    compare new items against it.
  - The collaborative approach will rather recommend items that other
    similar users have liked. This approach consider only the
    relationship between users, and not the profile of the user we are
    making recommendations to.
  - the hybrid approach, which appeared recently combine both of the
    previous approaches, giving recommendations when items score high
    regarding user's profile, or if a similar user already liked it.

Grouping is also something to consider at this stage (G. Myatt, 2007).
Because we are dealing with huge amount of data, it can be useful to
detect group of data that can fit together. Data clustering is able to
find such groups (T. Segaran, 2007).

References:

  - Balabanović, M., & Shoham, Y. (1997). Fab: content-based,
    collaborative recommendation. Communications of the ACM, 40(3),
    66–72. ACM. Retrieved March 1, 2011, from
    <http://portal.acm.org/citation.cfm?id=245108.245124&>;.
  - Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic
    web: Scientific american. Scientific American, 284(5), 34–43.
    Retrieved November 21, 2010, from
    <http://www.citeulike.org/group/222/article/1176986>.
  - Castellano, G., Fanelli, A., & Torsello, M. (2007). LODAP: a LOg
    DAta Preprocessor for mining Web browsing patterns. Proceedings of
    the 6th Conference on 6th WSEAS Int. Conf. on Artificial
    Intelligence, Knowledge Engineering and Data Bases-Volume 6 (p.
    12–17). World Scientific and Engineering Academy and Society
    (WSEAS). Retrieved March 8, 2011, from
    <http://portal.acm.org/citation.cfm?id=1348485.1348488>.
  - Kohlschutter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate
    detection using shallow text features. Proceedings of the third ACM
    international conference on Web search and data mining (p. 441–450).
    ACM. Retrieved March 8, 2011, from
    <http://portal.acm.org/citation.cfm?id=1718542>.
  - Myatt, G. J. (2007). Making Sense of Data: A Practical Guide to
    Exploratory Data Analysis and Data Mining.
  - Segaran, T. (2007). Collective Intelligence.

## Privacy

The first thing that's come to people minds when it comes to process
their browsing data is privacy. People don't want to be stalked. That's
perfectly right, and I don't either.

But such a system don't have to deal with people identities. It's
completely possible to process completely anonymous data, and that's
probably what I'm gonna do.

By the way, if you have interesting thoughts about that, if you do know
projects that do seems related, fire the comments \!

## What's the plan ?

There is a lot of different things to explore, especially because I'm a
complete novice in that field.

  - I want to develop a firefox plugin, to extract the browsing
    informations ( still, I need to know exactly which kind of
    informations to retrieve). The idea is to provide some *raw*
    browsing data, and then to transform it and to store it in the
    better possible way.
  - Analyse how to store the informations in a graph database. What can
    be the different methods to store this data and to visualize the
    relationship between different pieces of data? How can I define the
    different contexts, and add those informations in the db?
  - Process the data using well known recommendation algorithms. Compare
    the results and criticize their value.

There is plenty of stuff I want to try during this experimentation:

  - I want to try using Geshi to visualize the connexion between the
    links, and the contexts
  - Try using graph databases such as Neo4j
  - Having a deeper look at tools such as scikit.learn (a machine
    learning toolkit in python)
  - Analyse web pages in order to categorize them. Processing their
    contents as well, to do some keyword based classification will be
    done.

Lot of work on its way, yay \!