blog.notmyidea.org/using-jpype-to-bridge-python-and-java.html

107 lines
No EOL
7.4 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<title>Using JPype to bridge python and Java - Alexis Métaireau</title>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/main.css" type="text/css" />
<link href="https://blog.notmyidea.org/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Alexis Métaireau ATOM Feed" />
</head>
<body>
<section id="links">
<li>
<a class="" href="https://blog.notmyidea.org/" id="site-title">Blog</a>
</li>
<li><a class="" href="https://blog.notmyidea.org/pages/projets.html">Projets</a></li>
</section>
<header>
<h1 class="post-title">Using JPype to bridge python and Java</h1>
<time datetime="2011-06-11T00:00:00+02:00">11 juin 2011</time>
</header>
<article>
<p>Java provides some interesting libraries that have no exact equivalent
in python. In my case, the awesome boilerpipe library allows me to
remove uninteresting parts of HTML pages, like menus, footers and other
"boilerplate" contents.</p>
<p>Boilerpipe is written in Java. Two solutions then: using java from
python or reimplement boilerpipe in python. I will let you guess which
one I chosen, meh.</p>
<p>JPype allows to bridge python project with java libraries. It takes
another point of view than Jython: rather than reimplementing python in
Java, both languages are interfacing at the VM level. This means you
need to start a VM from your python script, but it does the job and stay
fully compatible with Cpython and its C extensions.</p>
<h2 id="first-steps-with-jpype">First steps with JPype</h2>
<p>Once JPype installed (you'll have to hack a bit some files to integrate
seamlessly with your system) you can access java classes by doing
something like that:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">jpype</span>
<span class="n">jpype</span><span class="o">.</span><span class="n">startJVM</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">getDefaultJVMPath</span><span class="p">())</span>
<span class="c1"># you can then access to the basic java functions</span>
<span class="n">jpype</span><span class="o">.</span><span class="n">java</span><span class="o">.</span><span class="n">lang</span><span class="o">.</span><span class="n">System</span><span class="o">.</span><span class="n">out</span><span class="o">.</span><span class="n">println</span><span class="p">(</span><span class="s2">&quot;hello world&quot;</span><span class="p">)</span>
<span class="c1"># and you have to shutdown the VM at the end</span>
<span class="n">jpype</span><span class="o">.</span><span class="n">shutdownJVM</span><span class="p">()</span>
</code></pre></div>
<p>Okay, now we have a hello world, but what we want seems somehow more
complex. We want to interact with java classes, so we will have to load
them.</p>
<h2 id="interfacing-with-boilerpipe">Interfacing with Boilerpipe</h2>
<p>To install boilerpipe, you just have to run an ant script:</p>
<div class="highlight"><pre><span></span><code>$ <span class="nb">cd</span> boilerpipe
$ ant
</code></pre></div>
<p>Here is a simple example of how to use boilerpipe in Java, from their
sources</p>
<div class="highlight"><pre><span></span><code><span class="kn">package</span> <span class="nn">de.l3s.boilerpipe.demo</span><span class="p">;</span>
<span class="kn">import</span> <span class="nn">java.net.URL</span><span class="p">;</span>
<span class="kn">import</span> <span class="nn">de.l3s.boilerpipe.extractors.ArticleExtractor</span><span class="p">;</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Oneliner</span> <span class="p">{</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="p">(</span><span class="kd">final</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="p">)</span> <span class="kd">throws</span> <span class="n">Exception</span> <span class="p">{</span>
<span class="kd">final</span> <span class="n">URL</span> <span class="n">url</span> <span class="o">=</span> <span class="k">new</span> <span class="n">URL</span><span class="p">(</span><span class="s">&quot;http://notmyidea.org&quot;</span><span class="p">);</span>
<span class="n">System</span><span class="p">.</span><span class="na">out</span><span class="p">.</span><span class="na">println</span><span class="p">(</span><span class="n">ArticleExtractor</span><span class="p">.</span><span class="na">INSTANCE</span><span class="p">.</span><span class="na">getText</span><span class="p">(</span><span class="n">url</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div>
<p>To run it:</p>
<div class="highlight"><pre><span></span><code>$ javac -cp dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar src/demo/de/l3s/boilerpipe/demo/Oneliner.java
$ java -cp src/demo:dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar de.l3s.boilerpipe.demo.Oneliner
</code></pre></div>
<p>Yes, this is kind of ugly, sorry for your eyes. Let's try something
similar, but from python</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">jpype</span>
<span class="c1"># start the JVM with the good classpaths</span>
<span class="n">classpath</span> <span class="o">=</span> <span class="s2">&quot;dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar&quot;</span>
<span class="n">jpype</span><span class="o">.</span><span class="n">startJVM</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">getDefaultJVMPath</span><span class="p">(),</span> <span class="s2">&quot;-Djava.class.path=</span><span class="si">%s</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">classpath</span><span class="p">)</span>
<span class="c1"># get the Java classes we want to use</span>
<span class="n">DefaultExtractor</span> <span class="o">=</span> <span class="n">jpype</span><span class="o">.</span><span class="n">JPackage</span><span class="p">(</span><span class="s2">&quot;de&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">l3s</span><span class="o">.</span><span class="n">boilerpipe</span><span class="o">.</span><span class="n">extractors</span><span class="o">.</span><span class="n">DefaultExtractor</span>
<span class="c1"># call them !</span>
<span class="nb">print</span> <span class="n">DefaultExtractor</span><span class="o">.</span><span class="n">INSTANCE</span><span class="o">.</span><span class="n">getText</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">java</span><span class="o">.</span><span class="n">net</span><span class="o">.</span><span class="n">URL</span><span class="p">(</span><span class="s2">&quot;http://blog.notmyidea.org&quot;</span><span class="p">))</span>
</code></pre></div>
<p>And you get what you want.</p>
<p>I must say I didn't thought it could work so easily. This will allow me
to extract text content from URLs and remove the <em>boilerplate</em> text
easily for infuse (my master thesis project), without having to write
java code, nice!</p>
</article>
</body>
</html>