mirror of
https://github.com/almet/notmyidea.git
synced 2025-04-28 19:42:37 +02:00
105 lines
No EOL
7.6 KiB
HTML
105 lines
No EOL
7.6 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
<head>
|
|
<title>Using JPype to bridge python and Java - Alexis Métaireau</title>
|
|
<meta charset="utf-8" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1">
|
|
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/main.css" type="text/css" />
|
|
<link href="https://blog.notmyidea.org/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Alexis Métaireau ATOM Feed" />
|
|
</head>
|
|
<body>
|
|
<section id="links">
|
|
<li>
|
|
<a class="" href="https://blog.notmyidea.org/" id="site-title">Blog</a>
|
|
</li>
|
|
<li><a class="" href="https://blog.notmyidea.org/pages/projets.html">Projets</a></li>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
|
|
<header>
|
|
<h1 class="post-title">Using JPype to bridge python and Java</h1>
|
|
<time datetime="2011-06-11T00:00:00+02:00">11 juin 2011</time>
|
|
|
|
|
|
</header>
|
|
<article>
|
|
|
|
<p>Java provides some interesting libraries that have no exact equivalent
|
|
in python. In my case, the awesome boilerpipe library allows me to
|
|
remove uninteresting parts of <span class="caps">HTML</span> pages, like menus, footers and other
|
|
“boilerplate” contents.</p>
|
|
<p>Boilerpipe is written in Java. Two solutions then: using java from
|
|
python or reimplement boilerpipe in python. I will let you guess which
|
|
one I chosen, meh.</p>
|
|
<p>JPype allows to bridge python project with java libraries. It takes
|
|
another point of view than Jython: rather than reimplementing python in
|
|
Java, both languages are interfacing at the <span class="caps">VM</span> level. This means you
|
|
need to start a <span class="caps">VM</span> from your python script, but it does the job and stay
|
|
fully compatible with Cpython and its C extensions.</p>
|
|
<h2 id="first-steps-with-jpype">First steps with JPype</h2>
|
|
<p>Once JPype installed (you’ll have to hack a bit some files to integrate
|
|
seamlessly with your system) you can access java classes by doing
|
|
something like that:</p>
|
|
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">jpype</span>
|
|
<span class="n">jpype</span><span class="o">.</span><span class="n">startJVM</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">getDefaultJVMPath</span><span class="p">())</span>
|
|
|
|
<span class="c1"># you can then access to the basic java functions</span>
|
|
<span class="n">jpype</span><span class="o">.</span><span class="n">java</span><span class="o">.</span><span class="n">lang</span><span class="o">.</span><span class="n">System</span><span class="o">.</span><span class="n">out</span><span class="o">.</span><span class="n">println</span><span class="p">(</span><span class="s2">"hello world"</span><span class="p">)</span>
|
|
|
|
<span class="c1"># and you have to shutdown the VM at the end</span>
|
|
<span class="n">jpype</span><span class="o">.</span><span class="n">shutdownJVM</span><span class="p">()</span>
|
|
</code></pre></div>
|
|
|
|
<p>Okay, now we have a hello world, but what we want seems somehow more
|
|
complex. We want to interact with java classes, so we will have to load them.</p>
|
|
<h2 id="interfacing-with-boilerpipe">Interfacing with Boilerpipe</h2>
|
|
<p>To install boilerpipe, you just have to run an ant script:</p>
|
|
<div class="highlight"><pre><span></span><code>$ <span class="nb">cd</span> boilerpipe
|
|
$ ant
|
|
</code></pre></div>
|
|
|
|
<p>Here is a simple example of how to use boilerpipe in Java, from their sources</p>
|
|
<div class="highlight"><pre><span></span><code><span class="kn">package</span> <span class="nn">de.l3s.boilerpipe.demo</span><span class="p">;</span>
|
|
<span class="kn">import</span> <span class="nn">java.net.URL</span><span class="p">;</span>
|
|
<span class="kn">import</span> <span class="nn">de.l3s.boilerpipe.extractors.ArticleExtractor</span><span class="p">;</span>
|
|
|
|
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Oneliner</span> <span class="p">{</span>
|
|
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="p">(</span><span class="kd">final</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="p">)</span> <span class="kd">throws</span> <span class="n">Exception</span> <span class="p">{</span>
|
|
<span class="kd">final</span> <span class="n">URL</span> <span class="n">url</span> <span class="o">=</span> <span class="k">new</span> <span class="n">URL</span><span class="p">(</span><span class="s">"http://notmyidea.org"</span><span class="p">);</span>
|
|
<span class="n">System</span><span class="p">.</span><span class="na">out</span><span class="p">.</span><span class="na">println</span><span class="p">(</span><span class="n">ArticleExtractor</span><span class="p">.</span><span class="na">INSTANCE</span><span class="p">.</span><span class="na">getText</span><span class="p">(</span><span class="n">url</span><span class="p">));</span>
|
|
<span class="p">}</span>
|
|
<span class="p">}</span>
|
|
</code></pre></div>
|
|
|
|
<p>To run it:</p>
|
|
<div class="highlight"><pre><span></span><code>$ javac -cp dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar src/demo/de/l3s/boilerpipe/demo/Oneliner.java
|
|
$ java -cp src/demo:dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar de.l3s.boilerpipe.demo.Oneliner
|
|
</code></pre></div>
|
|
|
|
<p>Yes, this is kind of ugly, sorry for your eyes. Let’s try something
|
|
similar, but from python</p>
|
|
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">jpype</span>
|
|
|
|
<span class="c1"># start the JVM with the good classpaths</span>
|
|
<span class="n">classpath</span> <span class="o">=</span> <span class="s2">"dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar"</span>
|
|
<span class="n">jpype</span><span class="o">.</span><span class="n">startJVM</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">getDefaultJVMPath</span><span class="p">(),</span> <span class="s2">"-Djava.class.path=</span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span> <span class="n">classpath</span><span class="p">)</span>
|
|
|
|
<span class="c1"># get the Java classes we want to use</span>
|
|
<span class="n">DefaultExtractor</span> <span class="o">=</span> <span class="n">jpype</span><span class="o">.</span><span class="n">JPackage</span><span class="p">(</span><span class="s2">"de"</span><span class="p">)</span><span class="o">.</span><span class="n">l3s</span><span class="o">.</span><span class="n">boilerpipe</span><span class="o">.</span><span class="n">extractors</span><span class="o">.</span><span class="n">DefaultExtractor</span>
|
|
|
|
<span class="c1"># call them !</span>
|
|
<span class="nb">print</span> <span class="n">DefaultExtractor</span><span class="o">.</span><span class="n">INSTANCE</span><span class="o">.</span><span class="n">getText</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">java</span><span class="o">.</span><span class="n">net</span><span class="o">.</span><span class="n">URL</span><span class="p">(</span><span class="s2">"http://blog.notmyidea.org"</span><span class="p">))</span>
|
|
</code></pre></div>
|
|
|
|
<p>And you get what you want.</p>
|
|
<p>I must say I didn’t thought it could work so easily. This will allow me
|
|
to extract text content from URLs and remove the <em>boilerplate</em> text
|
|
easily for infuse (my master thesis project), without having to write
|
|
java code, nice!</p>
|
|
</article>
|
|
|
|
</body>
|
|
</html> |