blog.notmyidea.org/using-jpype-to-bridge-python-and-java.html
2019-11-07 17:39:15 +01:00

220 lines
No EOL
13 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">
<link rel="shortcut icon" type="image/x-icon" href="favicon.ico" />
<title>Using JPype to bridge python and Java - Alexis - Carnets en ligne</title>
<meta charset="utf-8" />
<link href="https://blog.notmyidea.org/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Alexis - Carnets en ligne Full Atom Feed" />
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/poole.css"/>
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/syntax.css"/>
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/lanyon.css"/>
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/styles.css"/>
<style>
h1 {
font-family: "Avant Garde", Avantgarde, "Century Gothic", CenturyGothic, "AppleGothic", sans-serif;
padding: 80px 50px;
text-align: center;
text-transform: uppercase;
text-rendering: optimizeLegibility;
color: #202020;
letter-spacing: .1em;
text-shadow:
-1px -1px 1px #111,
2px 2px 1px #eaeaea;
}
#main {
text-align: justify;
text-justify: inter-word;
}
#main h1 {
padding: 10px;
}
.post-headline {
padding: 15px;
}
</style>
</head>
<body>
<!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
styles, `#sidebar-checkbox` for behavior. -->
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
<!-- Toggleable sidebar -->
<div class="sidebar" id="sidebar">
<div class="sidebar-item">
<div class="profile">
<img src="https://blog.notmyidea.org/theme/img/profile.png"/>
</div>
</div>
<nav class="sidebar-nav">
<a class="sidebar-nav-item" href="/">Articles</a>
<a class="sidebar-nav-item" href="https://www.vieuxsinge.com">Brasserie du Vieux Singe</a>
<a class="sidebar-nav-item" href="http://blog.notmyidea.org/pages/about.html">A propos</a>
<a class="sidebar-nav-item" href="https://twitter.com/ametaireau">Messages courts</a>
<a class="sidebar-nav-item" href="https://github.com/almet">Code</a>
</nav>
</div> <div class="wrap">
<div class="masthead">
<div class="container">
<h3 class="masthead-title">
<a href="https://blog.notmyidea.org/" title="Home">Alexis - Carnets en ligne</a>
</h3>
</div>
</div>
<div class="container content">
<div id="main" class="posts">
<h1 class="post-title">Using JPype to bridge python and Java</h1>
<span class="post-date">11 juin 2011, dans <a class="no-color" href="category/technologie.html">Technologie</a></span>
<img id="illustration" src="" />
<div class="post article">
<h1>🌟</h1>
<p>Java provides some interesting libraries that have no exact equivalent
in python. In my case, the awesome boilerpipe library allows me to
remove uninteresting parts of HTML pages, like menus, footers and other
"boilerplate" contents.</p>
<p>Boilerpipe is written in Java. Two solutions then: using java from
python or reimplement boilerpipe in python. I will let you guess which
one I chosen, meh.</p>
<p>JPype allows to bridge python project with java libraries. It takes
another point of view than Jython: rather than reimplementing python in
Java, both languages are interfacing at the VM level. This means you
need to start a VM from your python script, but it does the job and stay
fully compatible with Cpython and its C extensions.</p>
<h2 id="first-steps-with-jpype">First steps with JPype</h2>
<p>Once JPype installed (you'll have to hack a bit some files to integrate
seamlessly with your system) you can access java classes by doing
something like that:</p>
<p>``` sourceCode python
import jpype
jpype.startJVM(jpype.getDefaultJVMPath())</p>
<h1 id="you-can-then-access-to-the-basic-java-functions">you can then access to the basic java functions</h1>
<p>jpype.java.lang.System.out.println("hello world")</p>
<h1 id="and-you-have-to-shutdown-the-vm-at-the-end">and you have to shutdown the VM at the end</h1>
<p>jpype.shutdownJVM()</p>
<div class="highlight"><pre><span></span><span class="n">Okay</span><span class="p">,</span> <span class="n">now</span> <span class="n">we</span> <span class="n">have</span> <span class="n">a</span> <span class="n">hello</span> <span class="n">world</span><span class="p">,</span> <span class="n">but</span> <span class="n">what</span> <span class="n">we</span> <span class="n">want</span> <span class="n">seems</span> <span class="n">somehow</span> <span class="n">more</span>
<span class="nb">complex</span><span class="o">.</span> <span class="n">We</span> <span class="n">want</span> <span class="n">to</span> <span class="n">interact</span> <span class="k">with</span> <span class="n">java</span> <span class="n">classes</span><span class="p">,</span> <span class="n">so</span> <span class="n">we</span> <span class="n">will</span> <span class="n">have</span> <span class="n">to</span> <span class="n">load</span>
<span class="n">them</span><span class="o">.</span>
<span class="c1">## Interfacing with Boilerpipe</span>
<span class="n">To</span> <span class="n">install</span> <span class="n">boilerpipe</span><span class="p">,</span> <span class="n">you</span> <span class="n">just</span> <span class="n">have</span> <span class="n">to</span> <span class="n">run</span> <span class="n">an</span> <span class="n">ant</span> <span class="n">script</span><span class="p">:</span>
<span class="err">$</span> <span class="n">cd</span> <span class="n">boilerpipe</span>
<span class="err">$</span> <span class="n">ant</span>
<span class="n">Here</span> <span class="ow">is</span> <span class="n">a</span> <span class="n">simple</span> <span class="n">example</span> <span class="n">of</span> <span class="n">how</span> <span class="n">to</span> <span class="n">use</span> <span class="n">boilerpipe</span> <span class="ow">in</span> <span class="n">Java</span><span class="p">,</span> <span class="kn">from</span> <span class="nn">their</span>
<span class="n">sources</span>
<span class="sb">``</span><span class="err">`</span> <span class="n">sourceCode</span> <span class="n">java</span>
<span class="n">package</span> <span class="n">de</span><span class="o">.</span><span class="n">l3s</span><span class="o">.</span><span class="n">boilerpipe</span><span class="o">.</span><span class="n">demo</span><span class="p">;</span>
<span class="kn">import</span> <span class="nn">java.net.URL</span><span class="p">;</span>
<span class="kn">import</span> <span class="nn">de.l3s.boilerpipe.extractors.ArticleExtractor</span><span class="p">;</span>
<span class="n">public</span> <span class="k">class</span> <span class="nc">Oneliner</span> <span class="p">{</span>
<span class="n">public</span> <span class="n">static</span> <span class="n">void</span> <span class="n">main</span><span class="p">(</span><span class="n">final</span> <span class="n">String</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span> <span class="n">throws</span> <span class="ne">Exception</span> <span class="p">{</span>
<span class="n">final</span> <span class="n">URL</span> <span class="n">url</span> <span class="o">=</span> <span class="n">new</span> <span class="n">URL</span><span class="p">(</span><span class="s2">&quot;http://notmyidea.org&quot;</span><span class="p">);</span>
<span class="n">System</span><span class="o">.</span><span class="n">out</span><span class="o">.</span><span class="n">println</span><span class="p">(</span><span class="n">ArticleExtractor</span><span class="o">.</span><span class="n">INSTANCE</span><span class="o">.</span><span class="n">getText</span><span class="p">(</span><span class="n">url</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
</pre></div>
<p>To run
it:</p>
<p>``` sourceCode bash
$ javac -cp dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar src/demo/de/l3s/boilerpipe/demo/Oneliner.java
$ java -cp src/demo:dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar de.l3s.boilerpipe.demo.Oneliner</p>
<div class="highlight"><pre><span></span><span class="n">Yes</span><span class="p">,</span> <span class="n">this</span> <span class="ow">is</span> <span class="n">kind</span> <span class="n">of</span> <span class="n">ugly</span><span class="p">,</span> <span class="n">sorry</span> <span class="k">for</span> <span class="n">your</span> <span class="n">eyes</span><span class="o">.</span> <span class="n">Let</span><span class="s1">&#39;s try something</span>
<span class="n">similar</span><span class="p">,</span> <span class="n">but</span> <span class="kn">from</span> <span class="nn">python</span>
<span class="sb">``</span><span class="err">`</span> <span class="n">sourceCode</span> <span class="n">python</span>
<span class="kn">import</span> <span class="nn">jpype</span>
<span class="c1"># start the JVM with the good classpaths</span>
<span class="n">classpath</span> <span class="o">=</span> <span class="s2">&quot;dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar&quot;</span>
<span class="n">jpype</span><span class="o">.</span><span class="n">startJVM</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">getDefaultJVMPath</span><span class="p">(),</span> <span class="s2">&quot;-Djava.class.path=</span><span class="si">%s</span><span class="s2">&quot;</span> <span class="o">%</span> <span class="n">classpath</span><span class="p">)</span>
<span class="c1"># get the Java classes we want to use</span>
<span class="n">DefaultExtractor</span> <span class="o">=</span> <span class="n">jpype</span><span class="o">.</span><span class="n">JPackage</span><span class="p">(</span><span class="s2">&quot;de&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">l3s</span><span class="o">.</span><span class="n">boilerpipe</span><span class="o">.</span><span class="n">extractors</span><span class="o">.</span><span class="n">DefaultExtractor</span>
<span class="c1"># call them !</span>
<span class="k">print</span> <span class="n">DefaultExtractor</span><span class="o">.</span><span class="n">INSTANCE</span><span class="o">.</span><span class="n">getText</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">java</span><span class="o">.</span><span class="n">net</span><span class="o">.</span><span class="n">URL</span><span class="p">(</span><span class="s2">&quot;http://blog.notmyidea.org&quot;</span><span class="p">))</span>
</pre></div>
<p>And you get what you want.</p>
<p>I must say I didn't thought it could work so easily. This will allow me
to extract text content from URLs and remove the <em>boilerplate</em> text
easily for infuse (my master thesis project), without having to write
java code, nice!</p>
</div>
</div>
</div>
<label for="sidebar-checkbox" class="sidebar-toggle"></label>
<script>
(function(document) {
var i = 0;
// snip empty header rows since markdown can't
var rows = document.querySelectorAll('tr');
for(i=0; i<rows.length; i++) {
var ths = rows[i].querySelectorAll('th');
var rowlen = rows[i].children.length;
if (ths.length > 0 && ths.length === rowlen) {
rows[i].remove();
}
}
})(document);
</script>
<script>
/* Lanyon & Poole are Copyright (c) 2014 Mark Otto. Adapted to Pelican 20141223 and extended a bit by @thomaswilley */
(function(document) {
var toggle = document.querySelector('.sidebar-toggle');
var sidebar = document.querySelector('#sidebar');
var checkbox = document.querySelector('#sidebar-checkbox');
document.addEventListener('click', function(e) {
var target = e.target;
if(!checkbox.checked ||
sidebar.contains(target) ||
(target === checkbox || target === toggle)) return;
checkbox.checked = false;
}, false);
})(document);
</script>
<!-- Piwik -->
<script type="text/javascript">
var _paq = _paq || [];
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="//tracker.notmyidea.org/";
_paq.push(['setTrackerUrl', u+'piwik.php']);
_paq.push(['setSiteId', 3]);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'piwik.js'; s.parentNode.insertBefore(g,s);
})();
</script>
<noscript><p><img src="//tracker.notmyidea.org/piwik.php?idsite=3" style="border:0;" alt="" /></p></noscript>
<!-- End Piwik Code -->
</div>
</body>
</html>