blog.notmyidea.org/introducing-the-distutils2-index-crawlers.html

<!DOCTYPE HTML>
<html>
<head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <link rel="stylesheet" href="./theme/css/main.css" type="text/css" media="screen" charset="utf-8">
        <link href="./feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Alexis' log ATOM Feed" />
    <title>Alexis Métaireau</title>
</head>
<body>
    <div id="top">
        <p class="author"><a href="./about.html">Alexis Métaireau</a>'s thoughs</p>
        <ul class="links">
<li><a href=".">↵ </a></li>
</ul>
    </div>
    <div class="content clear">
    <h1>Introducing the distutils2 index crawlers</h1>
    <p class="date">Published on Tue 06 July 2010</p>
    <p>I'm working for about a month for distutils2, even if I was being a
bit busy (as I had some class courses and exams to work on)</p>
<p>I'll try do sum-up my general feelings here, and the work I've made
so far. You can also find, if you're interested, my weekly
summaries in
<a class="reference external" href="http://wiki.notmyidea.org/distutils2_schedule">a dedicated wiki page</a>.</p>
<div class="section" id="general-feelings">
<h2>General feelings</h2>
<p>First, and it's a really important point, the GSoC is going very
well, for me as for other students, at least from my perspective.
It's a pleasure to work with such enthusiast people, as this make
the global atmosphere very pleasant to live.</p>
<p>First of all, I've spent time to read the existing codebase, and to
understand what we're going to do, and what's the rationale to do
so.</p>
<p>It's really clear for me now: what we're building is the
foundations of a packaging infrastructure in python. The fact is
that many projects co-exists, and comes all with their good
concepts. Distutils2 tries to take the interesting parts of all,
and to provide it in the python standard libs, respecting the
recently written PEP about packaging.</p>
<p>With distutils2, it will be simpler to make &quot;things&quot; compatible. So
if you think about a new way to deal with distributions and
packaging in python, you can use the Distutils2 APIs to do so.</p>
</div>
<div class="section" id="tasks">
<h2>Tasks</h2>
<p>My main task while working on distutils2 is to provide an
installation and an un-installation command, as described in PEP
376. For this, I first need to get informations about the existing
distributions (what's their version, name, metadata, dependencies,
etc.)</p>
<p>The main index, you probably know and use, is PyPI. You can access
it at <a class="reference external" href="http://pypi.python.org">http://pypi.python.org</a>.</p>
</div>
<div class="section" id="pypi-index-crawling">
<h2>PyPI index crawling</h2>
<p>There is two ways to get these informations from PyPI: using the
simple API, or via xml-rpc calls.</p>
<p>A goal was to use the version specifiers defined
in`PEP 345 &lt;<a class="reference external" href="http://www.python.org/dev/peps/pep-0345/">http://www.python.org/dev/peps/pep-0345/</a>&gt;`_ and to
provides a way to sort the grabbed distributions depending our
needs, to pick the version we want/need.</p>
<div class="section" id="using-the-simple-api">
<h3>Using the simple API</h3>
<p>The simple API is composed of HTML pages you can access at
<a class="reference external" href="http://pypi.python.org/simple/">http://pypi.python.org/simple/</a>.</p>
<p>Distribute and Setuptools already provides a crawler for that, but
it deals with their internal mechanisms, and I found that the code
was not so clear as I want, that's why I've preferred to pick up
the good ideas, and some implementation details, plus re-thinking
the global architecture.</p>
<p>The rules are simple: each project have a dedicated page, which
allows us to get informations about:</p>
<ul class="simple">
<li>the distribution download locations (for some versions)</li>
<li>homepage links</li>
<li>some other useful informations, as the bugtracker address, for
instance.</li>
</ul>
<p>If you want to find all the distributions of the &quot;EggsAndSpam&quot;
project, you could do the following (do not take so attention to
the names here, as the API will probably change a bit):</p>
<div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">index</span> <span class="o">=</span> <span class="n">SimpleIndex</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">index</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;EggsAndSpam&quot;</span><span class="p">)</span>
<span class="p">[</span><span class="n">EggsAndSpam</span> <span class="mf">1.1</span><span class="p">,</span> <span class="n">EggsAndSpam</span> <span class="mf">1.2</span><span class="p">,</span> <span class="n">EggsAndSpam</span> <span class="mf">1.3</span><span class="p">]</span>
</pre></div>
<p>We also could use version specifiers:</p>
<div class="highlight"><pre><span class="o">&gt;&gt;&gt;</span> <span class="n">index</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">&quot;EggsAndSpam (&lt; =1.2)&quot;</span><span class="p">)</span>
<span class="p">[</span><span class="n">EggsAndSpam</span> <span class="mf">1.1</span><span class="p">,</span> <span class="n">EggsAndSpam</span> <span class="mf">1.2</span><span class="p">]</span>
</pre></div>
<p>Internally, what's done here is the following:</p>
<ul class="simple">
<li>it process the
<a class="reference external" href="http://pypi.python.org/simple/FooBar/">http://pypi.python.org/simple/FooBar/</a>
page, searching for download URLs.</li>
<li>for each found distribution download URL, it creates an object,
containing informations about the project name, the version and the
URL where the archive remains.</li>
<li>it sort the found distributions, using version numbers. The
default behavior here is to prefer source distributions (over
binary ones), and to rely on the last &quot;final&quot; distribution (rather
than beta, alpha etc. ones)</li>
</ul>
<p>So, nothing hard or difficult here.</p>
<p>We provides a bunch of other features, like relying on the new PyPI
mirroring infrastructure or filter the found distributions by some
criterias. If you're curious, please browse the
<a class="reference external" href="http://distutils2.notmyidea.org/">distutils2 documentation</a>.</p>
</div>
<div class="section" id="using-xml-rpc">
<h3>Using xml-rpc</h3>
<p>We also can make some xmlrpc calls to retreive informations from
PyPI. It's a really more reliable way to get informations from from
the index (as it's just the index that provides the informations),
but cost processes on the PyPI distant server.</p>
<p>For now, this way of querying the xmlrpc client is not available on
Distutils2, as I'm working on it. The main pieces are already
present (I'll reuse some work I've made from the SimpleIndex
querying, and
<a class="reference external" href="http://github.com/ametaireau/pypiclient">some code already set up</a>),
what I need to do is to provide a xml-rpc PyPI mock server, and
that's on what I'm actually working on.</p>
</div>
</div>
<div class="section" id="processes">
<h2>Processes</h2>
<p>For now, I'm trying to follow the &quot;documentation, then test, then
code&quot; path, and that seems to be really needed while working with a
community. Code is hard to read/understand, compared to
documentation, and it's easier to change.</p>
<p>While writing the simple index crawling work, I must have done this
to avoid some changes on the API, and some loss of time.</p>
<p>Also, I've set up
<a class="reference external" href="http://wiki.notmyidea.org/distutils2_schedule">a schedule</a>, and
the goal is to be sure everything will be ready in time, for the
end of the summer. (And now, I need to learn to follow schedules
...)</p>
</div>

</div>
</body>
</html>