mirror of
https://github.com/almet/notmyidea.git
synced 2025-04-28 19:42:37 +02:00
128 lines
No EOL
7.5 KiB
HTML
128 lines
No EOL
7.5 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
|
|
<head>
|
|
<title>PyPI on CouchDB - Alexis Métaireau</title>
|
|
<meta charset="utf-8" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1">
|
|
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/main.css?v2" type="text/css" />
|
|
<link href="https://blog.notmyidea.org/feeds/all.atom.xml" type="application/atom+xml" rel="alternate"
|
|
title="Alexis Métaireau ATOM Feed" />
|
|
</head>
|
|
|
|
<body>
|
|
<div id="content">
|
|
<section id="links">
|
|
<ul>
|
|
<li><a class="main" href="/">Alexis Métaireau</a></li>
|
|
<li><a class=""
|
|
href="https://blog.notmyidea.org/journal/index.html">Journal</a></li>
|
|
<li><a class="selected"
|
|
href="https://blog.notmyidea.org/code/">Code, etc.</a></li>
|
|
<li><a class=""
|
|
href="https://blog.notmyidea.org/weeknotes/">Notes hebdo</a></li>
|
|
<li><a class=""
|
|
href="https://blog.notmyidea.org/lectures/">Lectures</a></li>
|
|
</ul>
|
|
</section>
|
|
|
|
<header>
|
|
<h1 class="post-title">PyPI on CouchDB</h1>
|
|
<time datetime="2011-01-20T00:00:00+01:00">20 janvier 2011</time>
|
|
|
|
|
|
</header>
|
|
<article>
|
|
|
|
<p>By now, there are two ways to retrieve data from PyPI (the Python
|
|
Package Index). You can both rely on xml/rpc or on the “simple” <span class="caps">API</span>. The
|
|
simple <span class="caps">API</span> is not so simple to use as the name suggest, and have several
|
|
existing drawbacks.</p>
|
|
<p>Basically, if you want to use informations coming from the simple <span class="caps">API</span>,
|
|
you will have to parse web pages manually, to extract informations using
|
|
some black vodoo magic. Badly, magic have a price, and it’s sometimes
|
|
impossible to get exactly the informations you want to get from this
|
|
index. That’s the technique currently being used by distutils2,
|
|
setuptools and pip.</p>
|
|
<p>On the other side, while <span class="caps">XML</span>/<span class="caps">RPC</span> is working fine, it’s requiring extra
|
|
work to the python servers each time you request something, which can
|
|
lead to some outages from time to time. Also, it’s important to point
|
|
out that, even if PyPI have a mirroring infrastructure, it’s only for
|
|
the so-called <em>simple</em> <span class="caps">API</span>, and not for the <span class="caps">XML</span>/<span class="caps">RPC</span>.</p>
|
|
<h2 id="couchdb">CouchDB</h2>
|
|
<p>Here comes CouchDB. CouchDB is a document oriented database, that knows
|
|
how to speak <span class="caps">REST</span> and <span class="caps">JSON</span>. It’s easy to use, and provides out of the
|
|
box a replication mechanism.</p>
|
|
<h2 id="so-what">So, what ?</h2>
|
|
<p>Hmm, I’m sure you got it. I’ve wrote a piece of software to link
|
|
informations from PyPI to a CouchDB instance. Then you can replicate all
|
|
the PyPI index with only one <span class="caps">HTTP</span> request on the CouchDB server. You can
|
|
also access the informations from the index directly using a <span class="caps">REST</span> <span class="caps">API</span>,
|
|
speaking json. Handy.</p>
|
|
<p>So PyPIonCouch is using the PyPI <span class="caps">XML</span>/<span class="caps">RPC</span> <span class="caps">API</span> to get data from PyPI, and
|
|
generate records in the CouchDB instance.</p>
|
|
<p>The final goal is to avoid to rely on this “simple” <span class="caps">API</span>, and rely on a
|
|
<span class="caps">REST</span> insterface instead. I have set up a couchdb server on my server,
|
|
which is available at
|
|
<a href="http://couchdb.notmyidea.org/_utils/database.html?pypi">http://couchdb.notmyidea.org/_utils/database.html?pypi</a>.</p>
|
|
<p>There is not a lot to see there for now, but I’ve done the first import
|
|
from PyPI yesterday and all went fine: it’s possible to access the
|
|
metadata of all PyPI projects via a <span class="caps">REST</span> interface. Next step is to
|
|
write a client for this <span class="caps">REST</span> interface in distutils2.</p>
|
|
<h2 id="example">Example</h2>
|
|
<p>For now, you can use pypioncouch via the command line, or via the python
|
|
<span class="caps">API</span>.</p>
|
|
<h3 id="using-the-command-line">Using the command line</h3>
|
|
<p>You can do something like that for a full import. This <strong>will</strong> take
|
|
long, because it’s fetching all the projects at pypi and importing their metadata:</p>
|
|
<div class="highlight"><pre><span></span><code><span class="err">$</span> <span class="n">pypioncouch</span> <span class="o">--</span><span class="n">fullimport</span> <span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">your</span><span class="o">.</span><span class="n">couchdb</span><span class="o">.</span><span class="n">instance</span><span class="o">/</span>
|
|
</code></pre></div>
|
|
|
|
<p>If you already have the data on your couchdb instance, you can just
|
|
update it with the last informations from pypi. <strong>However, I recommend
|
|
to just replicate the principal node, hosted at
|
|
<a href="http://couchdb.notmyidea.org/pypi/">http://couchdb.notmyidea.org/pypi/</a></strong>, to avoid the duplication of nodes:</p>
|
|
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>pypioncouch<span class="w"> </span>--update<span class="w"> </span>http://your.couchdb.instance/
|
|
</code></pre></div>
|
|
|
|
<p>The principal node is updated once a day by now, I’ll try to see if it’s
|
|
enough, and ajust with the time.</p>
|
|
<h3 id="using-the-python-api">Using the python <span class="caps">API</span></h3>
|
|
<p>You can also use the python <span class="caps">API</span> to interact with pypioncouch:</p>
|
|
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">pypioncouch</span> <span class="kn">import</span> <span class="n">XmlRpcImporter</span><span class="p">,</span> <span class="n">import_all</span><span class="p">,</span> <span class="n">update</span>
|
|
<span class="o">>>></span> <span class="n">full_import</span><span class="p">()</span>
|
|
<span class="o">>>></span> <span class="n">update</span><span class="p">()</span>
|
|
</code></pre></div>
|
|
|
|
<h2 id="whats-next">What’s next ?</h2>
|
|
<p>I want to make a couchapp, in order to navigate PyPI easily. Here are
|
|
some of the features I want to propose:</p>
|
|
<ul>
|
|
<li>List all the available projects</li>
|
|
<li>List all the projects, filtered by specifiers</li>
|
|
<li>List all the projects by author/maintainer</li>
|
|
<li>List all the projects by keywords</li>
|
|
<li>Page for each project.</li>
|
|
<li>Provide a PyPI “Simple” <span class="caps">API</span> equivalent, even if I want to replace
|
|
it, I do think it will be really easy to setup mirrors that way,
|
|
with the out of the box couchdb replication</li>
|
|
</ul>
|
|
<p>I also still need to polish the import mechanism, so I can directly
|
|
store in couchdb:</p>
|
|
<ul>
|
|
<li>The <span class="caps">OPML</span> files for each project</li>
|
|
<li>The upload_time as couchdb friendly format (list of int)</li>
|
|
<li>The tags as lists (currently it’s only a string separated by spaces</li>
|
|
</ul>
|
|
<p>The work I’ve done by now is available on
|
|
<a href="https://bitbucket.org/ametaireau/pypioncouch/">https://bitbucket.org/ametaireau/pypioncouch/</a>. Keep in mind that it’s
|
|
still a work in progress, and everything can break at any time. However,
|
|
any feedback will be appreciated !</p>
|
|
</article>
|
|
<footer>
|
|
<a id="feed" href="/feeds/all.atom.xml"><img src="/theme/rss.svg" /></a>
|
|
</footer>
|
|
</div>
|
|
</body>
|
|
|
|
</html> |