blog.notmyidea.org/introducing-the-distutils2-index-crawlers.html

<!DOCTYPE html>
<html lang="en">
  <head>
      <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">
    <link rel="shortcut icon" type="image/x-icon" href="favicon.ico" />

    <title>Introducing the distutils2 index crawlers - Alexis - Carnets en ligne</title>

    <meta charset="utf-8" />
    <link href="https://blog.notmyidea.org/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Alexis - Carnets en ligne Full Atom Feed" />
    <link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/poole.css"/>
    <link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/syntax.css"/>
    <link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/lanyon.css"/>
    <link rel="stylesheet" href="//fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">
    <link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/styles.css"/>


<style>

h1 {
    font-family: "Avant Garde", Avantgarde, "Century Gothic", CenturyGothic, "AppleGothic", sans-serif;
    padding: 80px 50px;
    text-align: center;
    text-transform: uppercase;
    text-rendering: optimizeLegibility;
    color: #202020;
    letter-spacing: .1em;
    text-shadow:
        -1px -1px 1px #111,
        2px 2px 1px #eaeaea;
}

#main {
    text-align: justify;
    text-justify: inter-word;
}
#main h1 {
    padding: 10px;
}

.post-headline {
    padding: 15px;
    text-align: center;
}
</style>
  </head>

  <body>
<!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
  <div class="sidebar-item">
    <div class="profile">
      <img src="https://blog.notmyidea.org/theme/img/profile.png"/>
    </div>
  </div>

  <nav class="sidebar-nav">
  <a class="sidebar-nav-item" href="/">Articles</a>

  <a class="sidebar-nav-item" href="https://www.vieuxsinge.com">Brasserie du Vieux Singe</a>
  <a class="sidebar-nav-item" href="http://blog.notmyidea.org/pages/about.html">A propos</a>
  <a class="sidebar-nav-item" href="https://twitter.com/ametaireau">Messages courts</a>
  <a class="sidebar-nav-item" href="https://github.com/almet">Code</a>
  </nav>
</div>    <div class="wrap">
      <div class="masthead">
        <div class="container">
          <h3 class="masthead-title">
            <a href="https://blog.notmyidea.org/" title="Home">Alexis - Carnets en ligne</a>
          </h3>
        </div>
      </div>

      <div class="container content">
<div id="main" class="posts">
<h1 class="post-title">Introducing the distutils2 index crawlers</h1>

<span class="post-date">
    06 juillet 2010, dans <a class="no-color" href="category/technologie.html">Technologie</a>
</span>
<img id="illustration" class="illustration-Technologie" src="" />

<div class="post article">
    <div id="toc_container">
      <div class="toc">
<ul>
<li><a href="#introducing-the-distutils2-index-crawlers">Introducing the distutils2 index crawlers</a><ul>
<li><a href="#general-feelings">General feelings</a></li>
<li><a href="#tasks">Tasks</a></li>
<li><a href="#pypi-index-crawling">PyPI index crawling</a><ul>
<li><a href="#using-the-simple-api">Using the simple API</a></li>
<li><a href="#using-xml-rpc">Using xml-rpc</a></li>
</ul>
</li>
<li><a href="#processes">Processes</a></li>
</ul>
</li>
</ul>
</div>

      </div>
    <h1>🌟</h1>

<p>I'm working for about a month for distutils2, even if I was being a bit
busy (as I had some class courses and exams to work on)</p>
<p>I'll try do sum-up my general feelings here, and the work I've made so
far. You can also find, if you're interested, my weekly summaries in <a href="http://wiki.notmyidea.org/distutils2_schedule">a
dedicated wiki page</a>.</p>
<h2 id="general-feelings">General feelings</h2>
<p>First, and it's a really important point, the GSoC is going very well,
for me as for other students, at least from my perspective. It's a
pleasure to work with such enthusiast people, as this make the global
atmosphere very pleasant to live.</p>
<p>First of all, I've spent time to read the existing codebase, and to
understand what we're going to do, and what's the rationale to do so.</p>
<p>It's really clear for me now: what we're building is the foundations of
a packaging infrastructure in python. The fact is that many projects
co-exists, and comes all with their good concepts. Distutils2 tries to
take the interesting parts of all, and to provide it in the python
standard libs, respecting the recently written PEP about packaging.</p>
<p>With distutils2, it will be simpler to make "things" compatible. So if
you think about a new way to deal with distributions and packaging in
python, you can use the Distutils2 APIs to do so.</p>
<h2 id="tasks">Tasks</h2>
<p>My main task while working on distutils2 is to provide an installation
and an un-installation command, as described in PEP 376. For this, I
first need to get informations about the existing distributions (what's
their version, name, metadata, dependencies, etc.)</p>
<p>The main index, you probably know and use, is PyPI. You can access it at
<a href="http://pypi.python.org">http://pypi.python.org</a>.</p>
<h2 id="pypi-index-crawling">PyPI index crawling</h2>
<p>There is two ways to get these informations from PyPI: using the simple
API, or via xml-rpc calls.</p>
<p>A goal was to use the version specifiers defined
in<a href="http://www.python.org/dev/peps/pep-0345/">PEP 345</a> and to provides a
way to sort the grabbed distributions depending our needs, to pick the
version we want/need.</p>
<h3 id="using-the-simple-api">Using the simple API</h3>
<p>The simple API is composed of HTML pages you can access at
<a href="http://pypi.python.org/simple/">http://pypi.python.org/simple/</a>.</p>
<p>Distribute and Setuptools already provides a crawler for that, but it
deals with their internal mechanisms, and I found that the code was not
so clear as I want, that's why I've preferred to pick up the good ideas,
and some implementation details, plus re-thinking the global
architecture.</p>
<p>The rules are simple: each project have a dedicated page, which allows
us to get informations about:</p>
<ul>
<li>the distribution download locations (for some versions)</li>
<li>homepage links</li>
<li>some other useful informations, as the bugtracker address, for
    instance.</li>
</ul>
<p>If you want to find all the distributions of the "EggsAndSpam" project,
you could do the following (do not take so attention to the names here,
as the API will probably change a bit):</p>
<p>``` sourceCode python</p>
<blockquote>
<blockquote>
<blockquote>
<p>index = SimpleIndex()
index.find("EggsAndSpam")
[EggsAndSpam 1.1, EggsAndSpam 1.2, EggsAndSpam 1.3]</p>
</blockquote>
</blockquote>
</blockquote>
<div class="highlight"><pre><span></span><span class="n">We</span> <span class="n">also</span> <span class="n">could</span> <span class="n">use</span> <span class="k">version</span> <span class="n">specifiers</span><span class="p">:</span>

<span class="o">```</span> <span class="n">sourceCode</span> <span class="n">python</span>
<span class="o">&gt;&gt;&gt;</span> <span class="k">index</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="ss">&quot;EggsAndSpam (&lt; =1.2)&quot;</span><span class="p">)</span>
<span class="p">[</span><span class="n">EggsAndSpam</span> <span class="mi">1</span><span class="p">.</span><span class="mi">1</span><span class="p">,</span> <span class="n">EggsAndSpam</span> <span class="mi">1</span><span class="p">.</span><span class="mi">2</span><span class="p">]</span>
</pre></div>


<p>Internally, what's done here is the following:</p>
<ul>
<li>it process the <a href="http://pypi.python.org/simple/FooBar/">http://pypi.python.org/simple/FooBar/</a> page,
    searching for download URLs.</li>
<li>for each found distribution download URL, it creates an object,
    containing informations about the project name, the version and the
    URL where the archive remains.</li>
<li>it sort the found distributions, using version numbers. The default
    behavior here is to prefer source distributions (over binary ones),
    and to rely on the last "final" distribution (rather than beta,
    alpha etc. ones)</li>
</ul>
<p>So, nothing hard or difficult here.</p>
<p>We provides a bunch of other features, like relying on the new PyPI
mirroring infrastructure or filter the found distributions by some
criterias. If you're curious, please browse the <a href="http://distutils2.notmyidea.org/">distutils2
documentation</a>.</p>
<h3 id="using-xml-rpc">Using xml-rpc</h3>
<p>We also can make some xmlrpc calls to retreive informations from PyPI.
It's a really more reliable way to get informations from from the index
(as it's just the index that provides the informations), but cost
processes on the PyPI distant server.</p>
<p>For now, this way of querying the xmlrpc client is not available on
Distutils2, as I'm working on it. The main pieces are already present
(I'll reuse some work I've made from the SimpleIndex querying, and <a href="http://github.com/ametaireau/pypiclient">some
code already set up</a>), what I
need to do is to provide a xml-rpc PyPI mock server, and that's on what
I'm actually working on.</p>
<h2 id="processes">Processes</h2>
<p>For now, I'm trying to follow the "documentation, then test, then code"
path, and that seems to be really needed while working with a community.
Code is hard to read/understand, compared to documentation, and it's
easier to change.</p>
<p>While writing the simple index crawling work, I must have done this to
avoid some changes on the API, and some loss of time.</p>
<p>Also, I've set up <a href="http://wiki.notmyidea.org/distutils2_schedule">a
schedule</a>, and the goal
is to be sure everything will be ready in time, for the end of the
summer. (And now, I need to learn to follow schedules ...)</p>
  </div>
</div>
      </div>

      <label for="sidebar-checkbox" class="sidebar-toggle"></label>

      <script>
        (function(document) {
          var i = 0;
          // snip empty header rows since markdown can't
          var rows = document.querySelectorAll('tr');
          for(i=0; i<rows.length; i++) {
            var ths = rows[i].querySelectorAll('th');
            var rowlen = rows[i].children.length;
            if (ths.length > 0 && ths.length === rowlen) {
              rows[i].remove();
            }
          }
        })(document);
      </script>

      <script>
        /* Lanyon & Poole are Copyright (c) 2014 Mark Otto. Adapted to Pelican 20141223 and extended a bit by @thomaswilley */
        (function(document) {
          var toggle = document.querySelector('.sidebar-toggle');
          var sidebar = document.querySelector('#sidebar');
          var checkbox = document.querySelector('#sidebar-checkbox');
          document.addEventListener('click', function(e) {
            var target = e.target;
            if(!checkbox.checked ||
            sidebar.contains(target) ||
            (target === checkbox || target === toggle)) return;
            checkbox.checked = false;
            }, false);
            })(document);
      </script>
      <!-- Piwik -->
      <script type="text/javascript">
        var _paq = _paq || [];
        _paq.push(['trackPageView']);
        _paq.push(['enableLinkTracking']);
        (function() {
          var u="//tracker.notmyidea.org/";
          _paq.push(['setTrackerUrl', u+'piwik.php']);
          _paq.push(['setSiteId', 3]);
          var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
          g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'piwik.js'; s.parentNode.insertBefore(g,s);
        })();
      </script>
      <noscript><p><img src="//tracker.notmyidea.org/piwik.php?idsite=3" style="border:0;" alt="" /></p></noscript>
      <!-- End Piwik Code -->
     </div>
  </body>
</html>