blog.notmyidea.org/pypi-on-couchdb.html
2019-11-17 19:15:12 +01:00

248 lines
No EOL
10 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">
<link rel="shortcut icon" type="image/x-icon" href="favicon.ico" />
<title>PyPI on CouchDB - Alexis - Carnets en ligne</title>
<meta charset="utf-8" />
<link href="https://blog.notmyidea.org/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Alexis - Carnets en ligne Full Atom Feed" />
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/poole.css"/>
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/syntax.css"/>
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/lanyon.css"/>
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/styles.css"/>
<style>
h1 {
font-family: "Avant Garde", Avantgarde, "Century Gothic", CenturyGothic, "AppleGothic", sans-serif;
padding: 80px 50px;
text-align: center;
text-transform: uppercase;
text-rendering: optimizeLegibility;
color: #202020;
letter-spacing: .1em;
text-shadow:
-1px -1px 1px #111,
2px 2px 1px #eaeaea;
}
#main {
text-align: justify;
text-justify: inter-word;
}
#main h1 {
padding: 10px;
}
.post-headline {
padding: 15px;
}
</style>
</head>
<body>
<!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
<div class="sidebar-item">
<div class="profile">
<img src="https://blog.notmyidea.org/theme/img/profile.png"/>
</div>
</div>
<nav class="sidebar-nav">
<a class="sidebar-nav-item" href="/">Articles</a>
<a class="sidebar-nav-item" href="https://www.vieuxsinge.com">Brasserie du Vieux Singe</a>
<a class="sidebar-nav-item" href="http://blog.notmyidea.org/pages/about.html">A propos</a>
<a class="sidebar-nav-item" href="https://twitter.com/ametaireau">Messages courts</a>
<a class="sidebar-nav-item" href="https://github.com/almet">Code</a>
</nav>
</div> <div class="wrap">
<div class="masthead">
<div class="container">
<h3 class="masthead-title">
<a href="https://blog.notmyidea.org/" title="Home">Alexis - Carnets en ligne</a>
</h3>
</div>
</div>
<div class="container content">
<div id="main" class="posts">
<h1 class="post-title">PyPI on CouchDB</h1>
<span class="post-date">20 janvier 2011, dans <a class="no-color" href="category/technologie.html">Technologie</a></span>
<img id="illustration" src="" />
<div class="post article">
<div id="toc_container">
<div class="toc">
<ul>
<li><a href="#pypi-on-couchdb">PyPI on CouchDB</a><ul>
<li><a href="#couchdb">CouchDB</a></li>
<li><a href="#so-what">So, what ?</a></li>
<li><a href="#example">Example</a><ul>
<li><a href="#using-the-command-line">Using the command line</a></li>
<li><a href="#using-the-python-api">Using the python API</a></li>
</ul>
</li>
<li><a href="#whats-next">What's next ?</a></li>
</ul>
</li>
</ul>
</div>
</div>
<h1>🌟</h1>
<p>By now, there are two ways to retrieve data from PyPI (the Python
Package Index). You can both rely on xml/rpc or on the "simple" API. The
simple API is not so simple to use as the name suggest, and have several
existing drawbacks.</p>
<p>Basically, if you want to use informations coming from the simple API,
you will have to parse web pages manually, to extract informations using
some black vodoo magic. Badly, magic have a price, and it's sometimes
impossible to get exactly the informations you want to get from this
index. That's the technique currently being used by distutils2,
setuptools and pip.</p>
<p>On the other side, while XML/RPC is working fine, it's requiring extra
work to the python servers each time you request something, which can
lead to some outages from time to time. Also, it's important to point
out that, even if PyPI have a mirroring infrastructure, it's only for
the so-called <em>simple</em> API, and not for the XML/RPC.</p>
<h2 id="couchdb">CouchDB</h2>
<p>Here comes CouchDB. CouchDB is a document oriented database, that knows
how to speak REST and JSON. It's easy to use, and provides out of the
box a replication mechanism.</p>
<h2 id="so-what">So, what ?</h2>
<p>Hmm, I'm sure you got it. I've wrote a piece of software to link
informations from PyPI to a CouchDB instance. Then you can replicate all
the PyPI index with only one HTTP request on the CouchDB server. You can
also access the informations from the index directly using a REST API,
speaking json. Handy.</p>
<p>So PyPIonCouch is using the PyPI XML/RPC API to get data from PyPI, and
generate records in the CouchDB instance.</p>
<p>The final goal is to avoid to rely on this "simple" API, and rely on a
REST insterface instead. I have set up a couchdb server on my server,
which is available at
<a href="http://couchdb.notmyidea.org/_utils/database.html?pypi">http://couchdb.notmyidea.org/_utils/database.html?pypi</a>.</p>
<p>There is not a lot to see there for now, but I've done the first import
from PyPI yesterday and all went fine: it's possible to access the
metadata of all PyPI projects via a REST interface. Next step is to
write a client for this REST interface in distutils2.</p>
<h2 id="example">Example</h2>
<p>For now, you can use pypioncouch via the command line, or via the python
API.</p>
<h3 id="using-the-command-line">Using the command line</h3>
<p>You can do something like that for a full import. This <strong>will</strong> take
long, because it's fetching all the projects at pypi and importing their
metadata:</p>
<div class="highlight"><pre><span></span><span class="err">$</span> <span class="n">pypioncouch</span> <span class="o">--</span><span class="n">fullimport</span> <span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">your</span><span class="o">.</span><span class="n">couchdb</span><span class="o">.</span><span class="n">instance</span><span class="o">/</span>
</pre></div>
<p>If you already have the data on your couchdb instance, you can just
update it with the last informations from pypi. <strong>However, I recommend
to just replicate the principal node, hosted at
<a href="http://couchdb.notmyidea.org/pypi/">http://couchdb.notmyidea.org/pypi/</a></strong>, to avoid the duplication of
nodes:</p>
<div class="highlight"><pre><span></span>$ pypioncouch --update http://your.couchdb.instance/
</pre></div>
<p>The principal node is updated once a day by now, I'll try to see if it's
enough, and ajust with the time.</p>
<h3 id="using-the-python-api">Using the python API</h3>
<p>You can also use the python API to interact with pypioncouch:</p>
<div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="kn">from</span> <span class="nn">pypioncouch</span> <span class="kn">import</span> <span class="n">XmlRpcImporter</span><span class="p">,</span> <span class="n">import_all</span><span class="p">,</span> <span class="n">update</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">full_import</span><span class="p">()</span>
<span class="o">&gt;&gt;&gt;</span> <span class="n">update</span><span class="p">()</span>
</pre></div>
<h2 id="whats-next">What's next ?</h2>
<p>I want to make a couchapp, in order to navigate PyPI easily. Here are
some of the features I want to propose:</p>
<ul>
<li>List all the available projects</li>
<li>List all the projects, filtered by specifiers</li>
<li>List all the projects by author/maintainer</li>
<li>List all the projects by keywords</li>
<li>Page for each project.</li>
<li>Provide a PyPI "Simple" API equivalent, even if I want to replace
it, I do think it will be really easy to setup mirrors that way,
with the out of the box couchdb replication</li>
</ul>
<p>I also still need to polish the import mechanism, so I can directly
store in couchdb:</p>
<ul>
<li>The OPML files for each project</li>
<li>The upload_time as couchdb friendly format (list of int)</li>
<li>The tags as lists (currently it's only a string separated by spaces</li>
</ul>
<p>The work I've done by now is available on
<a href="https://bitbucket.org/ametaireau/pypioncouch/">https://bitbucket.org/ametaireau/pypioncouch/</a>. Keep in mind that it's
still a work in progress, and everything can break at any time. However,
any feedback will be appreciated !</p>
</div>
</div>
</div>
<label for="sidebar-checkbox" class="sidebar-toggle"></label>
<script>
(function(document) {
var i = 0;
// snip empty header rows since markdown can't
var rows = document.querySelectorAll('tr');
for(i=0; i<rows.length; i++) {
var ths = rows[i].querySelectorAll('th');
var rowlen = rows[i].children.length;
if (ths.length > 0 && ths.length === rowlen) {
rows[i].remove();
}
}
})(document);
</script>
<script>
/* Lanyon & Poole are Copyright (c) 2014 Mark Otto. Adapted to Pelican 20141223 and extended a bit by @thomaswilley */
(function(document) {
var toggle = document.querySelector('.sidebar-toggle');
var sidebar = document.querySelector('#sidebar');
var checkbox = document.querySelector('#sidebar-checkbox');
document.addEventListener('click', function(e) {
var target = e.target;
if(!checkbox.checked ||
sidebar.contains(target) ||
(target === checkbox || target === toggle)) return;
checkbox.checked = false;
}, false);
})(document);
</script>
<!-- Piwik -->
<script type="text/javascript">
var _paq = _paq || [];
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="//tracker.notmyidea.org/";
_paq.push(['setTrackerUrl', u+'piwik.php']);
_paq.push(['setSiteId', 3]);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'piwik.js'; s.parentNode.insertBefore(g,s);
})();
</script>
<noscript><p><img src="//tracker.notmyidea.org/piwik.php?idsite=3" style="border:0;" alt="" /></p></noscript>
<!-- End Piwik Code -->
</div>
</body>
</html>