mirror of
https://github.com/almet/notmyidea.git
synced 2025-04-28 19:42:37 +02:00
105 lines
No EOL
7.8 KiB
HTML
105 lines
No EOL
7.8 KiB
HTML
<!DOCTYPE HTML>
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="content-type" content="text/html; charset=utf-8">
|
|
<link rel="stylesheet" href="./theme/css/main.css" type="text/css" media="screen" charset="utf-8">
|
|
<link href="./feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Alexis' log ATOM Feed" />
|
|
<title>Alexis Métaireau</title>
|
|
</head>
|
|
<body>
|
|
<div id="top">
|
|
<p class="author"><a href=".">Alexis Métaireau</a>'s thougths</p>
|
|
<ul class="links"></ul>
|
|
</div>
|
|
<div class="content clear">
|
|
<h1>Using JPype to bridge python and Java</h1>
|
|
<p class="date">Published on Sat 11 June 2011</p>
|
|
<p>Java provides some interesting libraries that have no exact equivalent in
|
|
python. In my case, the awesome boilerpipe library allows me to remove
|
|
uninteresting parts of HTML pages, like menus, footers and other "boilerplate"
|
|
contents.</p>
|
|
<p>Boilerpipe is written in Java. Two solutions then: using java from python or
|
|
reimplement boilerpipe in python. I will let you guess which one I chosen, meh.</p>
|
|
<p>JPype allows to bridge python project with java libraries. It takes another
|
|
point of view than Jython: rather than reimplementing python in Java, both
|
|
languages are interfacing at the VM level. This means you need to start a VM
|
|
from your python script, but it does the job and stay fully compatible with
|
|
Cpython and its C extensions.</p>
|
|
<div class="section" id="first-steps-with-jpype">
|
|
<h2>First steps with JPype</h2>
|
|
<p>Once JPype installed (you'll have to hack a bit some files to integrate
|
|
seamlessly with your system) you can access java classes by doing something
|
|
like that:</p>
|
|
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">jpype</span>
|
|
<span class="n">jpype</span><span class="o">.</span><span class="n">startJVM</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">getDefaultJVMPath</span><span class="p">())</span>
|
|
|
|
<span class="c"># you can then access to the basic java functions</span>
|
|
<span class="n">jpype</span><span class="o">.</span><span class="n">java</span><span class="o">.</span><span class="n">lang</span><span class="o">.</span><span class="n">System</span><span class="o">.</span><span class="n">out</span><span class="o">.</span><span class="n">println</span><span class="p">(</span><span class="s">"hello world"</span><span class="p">)</span>
|
|
|
|
<span class="c"># and you have to shutdown the VM at the end</span>
|
|
<span class="n">jpype</span><span class="o">.</span><span class="n">shutdownJVM</span><span class="p">()</span>
|
|
</pre></div>
|
|
<p>Okay, now we have a hello world, but what we want seems somehow more complex.
|
|
We want to interact with java classes, so we will have to load them.</p>
|
|
</div>
|
|
<div class="section" id="interfacing-with-boilerpipe">
|
|
<h2>Interfacing with Boilerpipe</h2>
|
|
<p>To install boilerpipe, you just have to run an ant script:</p>
|
|
<pre class="literal-block">
|
|
$ cd boilerpipe
|
|
$ ant
|
|
</pre>
|
|
<p>Here is a simple example of how to use boilerpipe in Java, from their sources</p>
|
|
<div class="highlight"><pre><span class="kn">package</span> <span class="n">de</span><span class="o">.</span><span class="na">l3s</span><span class="o">.</span><span class="na">boilerpipe</span><span class="o">.</span><span class="na">demo</span><span class="o">;</span>
|
|
<span class="kn">import</span> <span class="nn">java.net.URL</span><span class="o">;</span>
|
|
<span class="kn">import</span> <span class="nn">de.l3s.boilerpipe.extractors.ArticleExtractor</span><span class="o">;</span>
|
|
|
|
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Oneliner</span> <span class="o">{</span>
|
|
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="kd">final</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">Exception</span> <span class="o">{</span>
|
|
<span class="kd">final</span> <span class="n">URL</span> <span class="n">url</span> <span class="o">=</span> <span class="k">new</span> <span class="n">URL</span><span class="o">(</span><span class="s">"http://notmyidea.org"</span><span class="o">);</span>
|
|
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">ArticleExtractor</span><span class="o">.</span><span class="na">INSTANCE</span><span class="o">.</span><span class="na">getText</span><span class="o">(</span><span class="n">url</span><span class="o">));</span>
|
|
<span class="o">}</span>
|
|
<span class="o">}</span>
|
|
</pre></div>
|
|
<p>To run it:</p>
|
|
<div class="highlight"><pre><span class="nv">$ </span>javac -cp dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar src/demo/de/l3s/boilerpipe/demo/Oneliner.java
|
|
<span class="nv">$ </span>java -cp src/demo:dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar de.l3s.boilerpipe.demo.Oneliner
|
|
</pre></div>
|
|
<p>Yes, this is kind of ugly, sorry for your eyes.
|
|
Let's try something similar, but from python</p>
|
|
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">jpype</span>
|
|
|
|
<span class="c"># start the JVM with the good classpaths</span>
|
|
<span class="n">classpath</span> <span class="o">=</span> <span class="s">"dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar"</span>
|
|
<span class="n">jpype</span><span class="o">.</span><span class="n">startJVM</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">getDefaultJVMPath</span><span class="p">(),</span> <span class="s">"-Djava.class.path=</span><span class="si">%s</span><span class="s">"</span> <span class="o">%</span> <span class="n">classpath</span><span class="p">)</span>
|
|
|
|
<span class="c"># get the Java classes we want to use</span>
|
|
<span class="n">DefaultExtractor</span> <span class="o">=</span> <span class="n">jpype</span><span class="o">.</span><span class="n">JPackage</span><span class="p">(</span><span class="s">"de"</span><span class="p">)</span><span class="o">.</span><span class="n">l3s</span><span class="o">.</span><span class="n">boilerpipe</span><span class="o">.</span><span class="n">extractors</span><span class="o">.</span><span class="n">DefaultExtractor</span>
|
|
|
|
<span class="c"># call them !</span>
|
|
<span class="k">print</span> <span class="n">DefaultExtractor</span><span class="o">.</span><span class="n">INSTANCE</span><span class="o">.</span><span class="n">getText</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">java</span><span class="o">.</span><span class="n">net</span><span class="o">.</span><span class="n">URL</span><span class="p">(</span><span class="s">"http://blog.notmyidea.org"</span><span class="p">))</span>
|
|
</pre></div>
|
|
<p>And you get what you want.</p>
|
|
<p>I must say I didn't thought it could work so easily. This will allow me to
|
|
extract text content from URLs and remove the <em>boilerplate</em> text easily
|
|
for infuse (my master thesis project), without having to write java code, nice!</p>
|
|
</div>
|
|
|
|
|
|
|
|
<div class="comments">
|
|
<h2>Comments</h2>
|
|
<div id="disqus_thread"></div>
|
|
<script type="text/javascript">
|
|
var disqus_identifier = "using-jpype-to-bridge-python-and-java.html";
|
|
(function() {
|
|
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
|
|
dsq.src = 'http://blog-notmyidea.disqus.com/embed.js';
|
|
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
|
|
})();
|
|
</script>
|
|
</div>
|
|
|
|
</div>
|
|
</body>
|
|
</html> |