mirror of
https://github.com/almet/notmyidea.git
synced 2025-04-28 19:42:37 +02:00
115 lines
No EOL
5.4 KiB
HTML
115 lines
No EOL
5.4 KiB
HTML
<!DOCTYPE HTML>
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="content-type" content="text/html; charset=utf-8">
|
|
<link rel="stylesheet" href="./theme/css/main.css" type="text/css" media="screen" charset="utf-8">
|
|
<link href="./feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Alexis' log ATOM Feed" />
|
|
<title>Alexis Métaireau</title>
|
|
</head>
|
|
<body>
|
|
<div id="top">
|
|
<p class="author"><a href="./about.html">Alexis Métaireau</a>'s thoughs</p>
|
|
<ul class="links">
|
|
<li><a href=".">↵ </a></li>
|
|
</ul>
|
|
</div>
|
|
<div class="content clear">
|
|
<h1>PyPI on CouchDB</h1>
|
|
<p class="date">Published on Thu 20 January 2011</p>
|
|
<p>By now, there are two ways to retrieve data from PyPI (the Python Package
|
|
Index). You can both rely on xml/rpc or on the "simple" API. The simple
|
|
API is not so simple to use as the name suggest, and have several existing
|
|
drawbacks.</p>
|
|
<p>Basically, if you want to use informations coming from the simple API, you will
|
|
have to parse web pages manually, to extract informations using some black
|
|
vodoo magic. Badly, magic have a price, and it's sometimes impossible to get
|
|
exactly the informations you want to get from this index. That's the technique
|
|
currently being used by distutils2, setuptools and pip.</p>
|
|
<p>On the other side, while XML/RPC is working fine, it's requiring extra work
|
|
to the python servers each time you request something, which can lead to
|
|
some outages from time to time. Also, it's important to point out that, even if
|
|
PyPI have a mirroring infrastructure, it's only for the so-called <em>simple</em> API,
|
|
and not for the XML/RPC.</p>
|
|
<div class="section" id="couchdb">
|
|
<h2>CouchDB</h2>
|
|
<p>Here comes CouchDB. CouchDB is a document oriented database, that
|
|
knows how to speak REST and JSON. It's easy to use, and provides out of the box
|
|
a replication mechanism.</p>
|
|
</div>
|
|
<div class="section" id="so-what">
|
|
<h2>So, what ?</h2>
|
|
<p>Hmm, I'm sure you got it. I've wrote a piece of software to link informations from
|
|
PyPI to a CouchDB instance. Then you can replicate all the PyPI index with only
|
|
one HTTP request on the CouchDB server. You can also access the informations
|
|
from the index directly using a REST API, speaking json. Handy.</p>
|
|
<p>So PyPIonCouch is using the PyPI XML/RPC API to get data from PyPI, and
|
|
generate records in the CouchDB instance.</p>
|
|
<p>The final goal is to avoid to rely on this "simple" API, and rely on a REST
|
|
insterface instead. I have set up a couchdb server on my server, which is
|
|
available at <a class="reference external" href="http://couchdb.notmyidea.org/_utils/database.html?pypi">http://couchdb.notmyidea.org/_utils/database.html?pypi</a>.</p>
|
|
<p>There is not a lot to
|
|
see there for now, but I've done the first import from PyPI yesterday and all
|
|
went fine: it's possible to access the metadata of all PyPI projects via a REST
|
|
interface. Next step is to write a client for this REST interface in
|
|
distutils2.</p>
|
|
</div>
|
|
<div class="section" id="example">
|
|
<h2>Example</h2>
|
|
<p>For now, you can use pypioncouch via the command line, or via the python API.</p>
|
|
<div class="section" id="using-the-command-line">
|
|
<h3>Using the command line</h3>
|
|
<p>You can do something like that for a full import. This <strong>will</strong> take long,
|
|
because it's fetching all the projects at pypi and importing their metadata:</p>
|
|
<pre class="literal-block">
|
|
$ pypioncouch --fullimport http://your.couchdb.instance/
|
|
</pre>
|
|
<p>If you already have the data on your couchdb instance, you can just update it
|
|
with the last informations from pypi. <strong>However, I recommend to just replicate
|
|
the principal node, hosted at http://couchdb.notmyidea.org/pypi/</strong>, to avoid
|
|
the duplication of nodes:</p>
|
|
<pre class="literal-block">
|
|
$ pypioncouch --update http://your.couchdb.instance/
|
|
</pre>
|
|
<p>The principal node is updated once a day by now, I'll try to see if it's
|
|
enough, and ajust with the time.</p>
|
|
</div>
|
|
<div class="section" id="using-the-python-api">
|
|
<h3>Using the python API</h3>
|
|
<p>You can also use the python API to interact with pypioncouch:</p>
|
|
<pre class="literal-block">
|
|
>>> from pypioncouch import XmlRpcImporter, import_all, update
|
|
>>> full_import()
|
|
>>> update()
|
|
</pre>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="what-s-next">
|
|
<h2>What's next ?</h2>
|
|
<p>I want to make a couchapp, in order to navigate PyPI easily. Here are some of
|
|
the features I want to propose:</p>
|
|
<ul class="simple">
|
|
<li>List all the available projects</li>
|
|
<li>List all the projects, filtered by specifiers</li>
|
|
<li>List all the projects by author/maintainer</li>
|
|
<li>List all the projects by keywords</li>
|
|
<li>Page for each project.</li>
|
|
<li>Provide a PyPI "Simple" API equivalent, even if I want to replace it, I do
|
|
think it will be really easy to setup mirrors that way, with the out of the
|
|
box couchdb replication</li>
|
|
</ul>
|
|
<p>I also still need to polish the import mechanism, so I can directly store in
|
|
couchdb:</p>
|
|
<ul class="simple">
|
|
<li>The OPML files for each project</li>
|
|
<li>The upload_time as couchdb friendly format (list of int)</li>
|
|
<li>The tags as lists (currently it's only a string separated by spaces</li>
|
|
</ul>
|
|
<p>The work I've done by now is available on
|
|
<a class="reference external" href="https://bitbucket.org/ametaireau/pypioncouch/">https://bitbucket.org/ametaireau/pypioncouch/</a>. Keep in mind that it's still
|
|
a work in progress, and everything can break at any time. However, any feedback
|
|
will be appreciated !</p>
|
|
</div>
|
|
|
|
</div>
|
|
</body>
|
|
</html> |