blog.notmyidea.org/using-jpype-to-bridge-python-and-java.html

225 lines
No EOL
12 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<title>Using JPype to bridge python and Java</title>
<meta charset="utf-8" />
<link rel="stylesheet" href="./theme/css/main.css" type="text/css" />
<link href="./feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Alexis' log ATOM Feed" />
<!--[if IE]>
<script src="http://html5shiv.googlecode.com/svn/trunk/html5.js"></script><![endif]-->
<!--[if lte IE 7]>
<link rel="stylesheet" type="text/css" media="all" href="./css/ie.css"/>
<script src="./js/IE8.js" type="text/javascript"></script><![endif]-->
<!--[if lt IE 7]>
<link rel="stylesheet" type="text/css" media="all" href="./css/ie6.css"/><![endif]-->
</head>
<body id="index" class="home">
<a href="http://github.com/ametaireau/">
<img style="position: absolute; top: 0; right: 0; border: 0;" src="http://s3.amazonaws.com/github/ribbons/forkme_right_red_aa0000.png" alt="Fork me on GitHub" />
</a>
<header id="banner" class="body">
<h1><a href=".">Alexis' log </a></h1>
<nav><ul>
<li><a href="./pages/projects.html">projects</a></li>
<li ><a href="./category/asso.html">asso</a></li>
<li class="active"><a href="./category/dev.html">dev</a></li>
<li ><a href="./category/python.html">python</a></li>
<li ><a href="./category/system.html">system</a></li>
<li ><a href="./category/thoughts.html">thoughts</a></li>
</ul></nav>
</header><!-- /#banner -->
<section id="content" class="body">
<article>
<header> <h1 class="entry-title"><a href=""
rel="bookmark" title="Permalink to Using JPype to bridge python and Java">Using JPype to bridge python and Java</a></h1> </header>
<div class="entry-content">
<footer class="post-info">
<abbr class="published" title="2011-06-11T00:00:00">
Sat 11 June 2011
</abbr>
<address class="vcard author">
By <a class="url fn" href="./author/Alexis Métaireau.html">Alexis Métaireau</a>
</address>
<p>In <a href="./category/dev.html">dev</a>. </p>
<p>tags: <a href="./tag/python.html">python</a><a href="./tag/java.html">java</a></p>
</footer><!-- /.post-info -->
<p>Java provides some interesting libraries that have no exact equivalent in
python. In my case, the awesome boilerpipe library allows me to remove
uninteresting parts of HTML pages, like menus, footers and other &quot;boilerplate&quot;
contents.</p>
<p>Boilerpipe is written in Java. Two solutions then: using java from python or
reimplement boilerpipe in python. I will let you guess which one I chosen, meh.</p>
<p>JPype allows to bridge python project with java libraries. It takes another
point of view than Jython: rather than reimplementing python in Java, both
languages are interfacing at the VM level. This means you need to start a VM
from your python script, but it does the job and stay fully compatible with
Cpython and its C extensions.</p>
<div class="section" id="first-steps-with-jpype">
<h2>First steps with JPype</h2>
<p>Once JPype installed (you'll have to hack a bit some files to integrate
seamlessly with your system) you can access java classes by doing something
like that:</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">jpype</span>
<span class="n">jpype</span><span class="o">.</span><span class="n">startJVM</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">getDefaultJVMPath</span><span class="p">())</span>
<span class="c"># you can then access to the basic java functions</span>
<span class="n">jpype</span><span class="o">.</span><span class="n">java</span><span class="o">.</span><span class="n">lang</span><span class="o">.</span><span class="n">System</span><span class="o">.</span><span class="n">out</span><span class="o">.</span><span class="n">println</span><span class="p">(</span><span class="s">&quot;hello world&quot;</span><span class="p">)</span>
<span class="c"># and you have to shutdown the VM at the end</span>
<span class="n">jpype</span><span class="o">.</span><span class="n">shutdownJVM</span><span class="p">()</span>
</pre></div>
<p>Okay, now we have a hello world, but what we want seems somehow more complex.
We want to interact with java classes, so we will have to load them.</p>
</div>
<div class="section" id="interfacing-with-boilerpipe">
<h2>Interfacing with Boilerpipe</h2>
<p>To install boilerpipe, you just have to run an ant script:</p>
<pre class="literal-block">
$ cd boilerpipe
$ ant
</pre>
<p>Here is a simple example of how to use boilerpipe in Java, from their sources</p>
<div class="highlight"><pre><span class="kn">package</span> <span class="n">de</span><span class="o">.</span><span class="na">l3s</span><span class="o">.</span><span class="na">boilerpipe</span><span class="o">.</span><span class="na">demo</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.net.URL</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">de.l3s.boilerpipe.extractors.ArticleExtractor</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Oneliner</span> <span class="o">{</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="kd">final</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">Exception</span> <span class="o">{</span>
<span class="kd">final</span> <span class="n">URL</span> <span class="n">url</span> <span class="o">=</span> <span class="k">new</span> <span class="n">URL</span><span class="o">(</span><span class="s">&quot;http://notmyidea.org&quot;</span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">ArticleExtractor</span><span class="o">.</span><span class="na">INSTANCE</span><span class="o">.</span><span class="na">getText</span><span class="o">(</span><span class="n">url</span><span class="o">));</span>
<span class="o">}</span>
<span class="o">}</span>
</pre></div>
<p>To run it:</p>
<div class="highlight"><pre><span class="nv">$ </span>javac -cp dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar src/demo/de/l3s/boilerpipe/demo/Oneliner.java
<span class="nv">$ </span>java -cp src/demo:dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar de.l3s.boilerpipe.demo.Oneliner
</pre></div>
<p>Yes, this is kind of ugly, sorry for your eyes.
Let's try something similar, but from python</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">jpype</span>
<span class="c"># start the JVM with the good classpaths</span>
<span class="n">classpath</span> <span class="o">=</span> <span class="s">&quot;dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar&quot;</span>
<span class="n">jpype</span><span class="o">.</span><span class="n">startJVM</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">getDefaultJVMPath</span><span class="p">(),</span> <span class="s">&quot;-Djava.class.path=</span><span class="si">%s</span><span class="s">&quot;</span> <span class="o">%</span> <span class="n">classpath</span><span class="p">)</span>
<span class="c"># get the Java classes we want to use</span>
<span class="n">DefaultExtractor</span> <span class="o">=</span> <span class="n">jpype</span><span class="o">.</span><span class="n">JPackage</span><span class="p">(</span><span class="s">&quot;de&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">l3s</span><span class="o">.</span><span class="n">boilerpipe</span><span class="o">.</span><span class="n">extractors</span><span class="o">.</span><span class="n">DefaultExtractor</span>
<span class="c"># call them !</span>
<span class="k">print</span> <span class="n">DefaultExtractor</span><span class="o">.</span><span class="n">INSTANCE</span><span class="o">.</span><span class="n">getText</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">java</span><span class="o">.</span><span class="n">net</span><span class="o">.</span><span class="n">URL</span><span class="p">(</span><span class="s">&quot;http://blog.notmyidea.org&quot;</span><span class="p">))</span>
</pre></div>
<p>And you get what you want.</p>
<p>I must say I didn't thought it could work so easily. This will allow me to
extract text content from URLs and remove the <em>boilerplate</em> text easily
for infuse (my master thesis project), without having to write java code, nice!</p>
</div>
</div><!-- /.entry-content -->
<div class="comments">
<h2>Comments !</h2>
<div id="disqus_thread"></div>
<script type="text/javascript">
var disqus_identifier = "using-jpype-to-bridge-python-and-java.html";
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = 'http://blog-notmyidea.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
</div>
</article>
</section>
<section id="extras" class="body">
<div class="blogroll">
<h2>blogroll</h2>
<ul>
<li><a href="http://biologeek.org">Biologeek</a></li>
<li><a href="http://filyb.info/">Filyb</a></li>
<li><a href="http://www.libert-fr.com">Libert-fr</a></li>
<li><a href="http://prendreuncafe.com/blog/">N1k0</a></li>
<li><a href="http://ziade.org/blog">Tarek Ziadé</a></li>
<li><a href="http://zubin71.wordpress.com/">Zubin Mithra</a></li>
</ul>
</div><!-- /.blogroll -->
<div class="social">
<h2>social</h2>
<ul>
<li><a href="./feeds/all.atom.xml" rel="alternate">atom feed</a></li>
<li><a href="http://twitter.com/ametaireau">twitter</a></li>
<li><a href="http://lastfm.com/user/akounet">lastfm</a></li>
<li><a href="http://github.com/ametaireau">github</a></li>
</ul>
</div><!-- /.social -->
</section><!-- /#extras -->
<footer id="contentinfo" class="body">
<address id="about" class="vcard body">
Proudly powered by <a href="http://alexis.notmyidea.org/pelican/">pelican</a>, which takes great advantages of <a href="http://python.org">python</a>.
</address><!-- /#about -->
<p>The theme is by <a href="http://coding.smashingmagazine.com/2009/08/04/designing-a-html-5-layout-from-scratch/">Smashing Magazine</a>, thanks!</p>
</footer><!-- /#contentinfo -->
<script type="text/javascript">
var disqus_shortname = 'blog-notmyidea';
(function () {
var s = document.createElement('script'); s.async = true;
s.type = 'text/javascript';
s.src = 'http://' + disqus_shortname + '.disqus.com/count.js';
(document.getElementsByTagName('HEAD')[0] || document.getElementsByTagName('BODY')[0]).appendChild(s);
}());
</script>
</body>
</html>