mirror of
https://github.com/almet/notmyidea.git
synced 2025-04-28 19:42:37 +02:00
220 lines
No EOL
13 KiB
HTML
220 lines
No EOL
13 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en">
|
|
<head>
|
|
<meta http-equiv="X-UA-Compatible" content="IE=edge">
|
|
<meta http-equiv="content-type" content="text/html; charset=utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1">
|
|
<link rel="shortcut icon" type="image/x-icon" href="favicon.ico" />
|
|
|
|
<title>Using JPype to bridge python and Java - Alexis - Carnets en ligne</title>
|
|
|
|
<meta charset="utf-8" />
|
|
<link href="https://blog.notmyidea.org/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Alexis - Carnets en ligne Full Atom Feed" />
|
|
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/poole.css"/>
|
|
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/syntax.css"/>
|
|
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/lanyon.css"/>
|
|
<link rel="stylesheet" href="//fonts.googleapis.com/css?family=PT+Serif:400,400italic,700%7CPT+Sans:400">
|
|
<link rel="stylesheet" href="https://blog.notmyidea.org/theme/css/styles.css"/>
|
|
|
|
|
|
|
|
<style>
|
|
|
|
h1 {
|
|
font-family: "Avant Garde", Avantgarde, "Century Gothic", CenturyGothic, "AppleGothic", sans-serif;
|
|
padding: 80px 50px;
|
|
text-align: center;
|
|
text-transform: uppercase;
|
|
text-rendering: optimizeLegibility;
|
|
color: #202020;
|
|
letter-spacing: .1em;
|
|
text-shadow:
|
|
-1px -1px 1px #111,
|
|
2px 2px 1px #eaeaea;
|
|
}
|
|
|
|
#main {
|
|
text-align: justify;
|
|
text-justify: inter-word;
|
|
}
|
|
#main h1 {
|
|
padding: 10px;
|
|
}
|
|
|
|
.post-headline {
|
|
padding: 15px;
|
|
}
|
|
</style>
|
|
</head>
|
|
|
|
<body>
|
|
<!-- Target for toggling the sidebar `.sidebar-checkbox` is for regular
|
|
styles, `#sidebar-checkbox` for behavior. -->
|
|
<input type="checkbox" class="sidebar-checkbox" id="sidebar-checkbox">
|
|
<!-- Toggleable sidebar -->
|
|
<div class="sidebar" id="sidebar">
|
|
<div class="sidebar-item">
|
|
<div class="profile">
|
|
<img src="https://blog.notmyidea.org/theme/img/profile.png"/>
|
|
</div>
|
|
</div>
|
|
|
|
<nav class="sidebar-nav">
|
|
<a class="sidebar-nav-item" href="/">Articles</a>
|
|
|
|
<a class="sidebar-nav-item" href="https://www.vieuxsinge.com">Brasserie du Vieux Singe</a>
|
|
<a class="sidebar-nav-item" href="http://blog.notmyidea.org/pages/about.html">A propos</a>
|
|
<a class="sidebar-nav-item" href="https://twitter.com/ametaireau">Messages courts</a>
|
|
<a class="sidebar-nav-item" href="https://github.com/almet">Code</a>
|
|
</nav>
|
|
</div> <div class="wrap">
|
|
<div class="masthead">
|
|
<div class="container">
|
|
<h3 class="masthead-title">
|
|
<a href="https://blog.notmyidea.org/" title="Home">Alexis - Carnets en ligne</a>
|
|
</h3>
|
|
</div>
|
|
</div>
|
|
|
|
<div class="container content">
|
|
<div id="main" class="posts">
|
|
<h1 class="post-title">Using JPype to bridge python and Java</h1>
|
|
<span class="post-date">11 juin 2011, dans <a class="no-color" href="category/technologie.html">Technologie</a></span>
|
|
<img id="illustration" src="" />
|
|
|
|
<div class="post article">
|
|
<h1>🌟</h1>
|
|
|
|
<p>Java provides some interesting libraries that have no exact equivalent
|
|
in python. In my case, the awesome boilerpipe library allows me to
|
|
remove uninteresting parts of HTML pages, like menus, footers and other
|
|
"boilerplate" contents.</p>
|
|
<p>Boilerpipe is written in Java. Two solutions then: using java from
|
|
python or reimplement boilerpipe in python. I will let you guess which
|
|
one I chosen, meh.</p>
|
|
<p>JPype allows to bridge python project with java libraries. It takes
|
|
another point of view than Jython: rather than reimplementing python in
|
|
Java, both languages are interfacing at the VM level. This means you
|
|
need to start a VM from your python script, but it does the job and stay
|
|
fully compatible with Cpython and its C extensions.</p>
|
|
<h2 id="first-steps-with-jpype">First steps with JPype</h2>
|
|
<p>Once JPype installed (you'll have to hack a bit some files to integrate
|
|
seamlessly with your system) you can access java classes by doing
|
|
something like that:</p>
|
|
<p>``` sourceCode python
|
|
import jpype
|
|
jpype.startJVM(jpype.getDefaultJVMPath())</p>
|
|
<h1 id="you-can-then-access-to-the-basic-java-functions">you can then access to the basic java functions</h1>
|
|
<p>jpype.java.lang.System.out.println("hello world")</p>
|
|
<h1 id="and-you-have-to-shutdown-the-vm-at-the-end">and you have to shutdown the VM at the end</h1>
|
|
<p>jpype.shutdownJVM()</p>
|
|
<div class="highlight"><pre><span></span><span class="n">Okay</span><span class="p">,</span> <span class="n">now</span> <span class="n">we</span> <span class="n">have</span> <span class="n">a</span> <span class="n">hello</span> <span class="n">world</span><span class="p">,</span> <span class="n">but</span> <span class="n">what</span> <span class="n">we</span> <span class="n">want</span> <span class="n">seems</span> <span class="n">somehow</span> <span class="n">more</span>
|
|
<span class="nb">complex</span><span class="o">.</span> <span class="n">We</span> <span class="n">want</span> <span class="n">to</span> <span class="n">interact</span> <span class="k">with</span> <span class="n">java</span> <span class="n">classes</span><span class="p">,</span> <span class="n">so</span> <span class="n">we</span> <span class="n">will</span> <span class="n">have</span> <span class="n">to</span> <span class="n">load</span>
|
|
<span class="n">them</span><span class="o">.</span>
|
|
|
|
<span class="c1">## Interfacing with Boilerpipe</span>
|
|
|
|
<span class="n">To</span> <span class="n">install</span> <span class="n">boilerpipe</span><span class="p">,</span> <span class="n">you</span> <span class="n">just</span> <span class="n">have</span> <span class="n">to</span> <span class="n">run</span> <span class="n">an</span> <span class="n">ant</span> <span class="n">script</span><span class="p">:</span>
|
|
|
|
<span class="err">$</span> <span class="n">cd</span> <span class="n">boilerpipe</span>
|
|
<span class="err">$</span> <span class="n">ant</span>
|
|
|
|
<span class="n">Here</span> <span class="ow">is</span> <span class="n">a</span> <span class="n">simple</span> <span class="n">example</span> <span class="n">of</span> <span class="n">how</span> <span class="n">to</span> <span class="n">use</span> <span class="n">boilerpipe</span> <span class="ow">in</span> <span class="n">Java</span><span class="p">,</span> <span class="kn">from</span> <span class="nn">their</span>
|
|
<span class="n">sources</span>
|
|
|
|
<span class="sb">``</span><span class="err">`</span> <span class="n">sourceCode</span> <span class="n">java</span>
|
|
<span class="n">package</span> <span class="n">de</span><span class="o">.</span><span class="n">l3s</span><span class="o">.</span><span class="n">boilerpipe</span><span class="o">.</span><span class="n">demo</span><span class="p">;</span>
|
|
<span class="kn">import</span> <span class="nn">java.net.URL</span><span class="p">;</span>
|
|
<span class="kn">import</span> <span class="nn">de.l3s.boilerpipe.extractors.ArticleExtractor</span><span class="p">;</span>
|
|
|
|
<span class="n">public</span> <span class="k">class</span> <span class="nc">Oneliner</span> <span class="p">{</span>
|
|
<span class="n">public</span> <span class="n">static</span> <span class="n">void</span> <span class="n">main</span><span class="p">(</span><span class="n">final</span> <span class="n">String</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span> <span class="n">throws</span> <span class="ne">Exception</span> <span class="p">{</span>
|
|
<span class="n">final</span> <span class="n">URL</span> <span class="n">url</span> <span class="o">=</span> <span class="n">new</span> <span class="n">URL</span><span class="p">(</span><span class="s2">"http://notmyidea.org"</span><span class="p">);</span>
|
|
<span class="n">System</span><span class="o">.</span><span class="n">out</span><span class="o">.</span><span class="n">println</span><span class="p">(</span><span class="n">ArticleExtractor</span><span class="o">.</span><span class="n">INSTANCE</span><span class="o">.</span><span class="n">getText</span><span class="p">(</span><span class="n">url</span><span class="p">));</span>
|
|
<span class="p">}</span>
|
|
<span class="p">}</span>
|
|
</pre></div>
|
|
|
|
|
|
<p>To run
|
|
it:</p>
|
|
<p>``` sourceCode bash
|
|
$ javac -cp dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar src/demo/de/l3s/boilerpipe/demo/Oneliner.java
|
|
$ java -cp src/demo:dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar de.l3s.boilerpipe.demo.Oneliner</p>
|
|
<div class="highlight"><pre><span></span><span class="n">Yes</span><span class="p">,</span> <span class="n">this</span> <span class="ow">is</span> <span class="n">kind</span> <span class="n">of</span> <span class="n">ugly</span><span class="p">,</span> <span class="n">sorry</span> <span class="k">for</span> <span class="n">your</span> <span class="n">eyes</span><span class="o">.</span> <span class="n">Let</span><span class="s1">'s try something</span>
|
|
<span class="n">similar</span><span class="p">,</span> <span class="n">but</span> <span class="kn">from</span> <span class="nn">python</span>
|
|
|
|
<span class="sb">``</span><span class="err">`</span> <span class="n">sourceCode</span> <span class="n">python</span>
|
|
<span class="kn">import</span> <span class="nn">jpype</span>
|
|
|
|
<span class="c1"># start the JVM with the good classpaths</span>
|
|
<span class="n">classpath</span> <span class="o">=</span> <span class="s2">"dist/boilerpipe-1.1-dev.jar:lib/nekohtml-1.9.13.jar:lib/xerces-2.9.1.jar"</span>
|
|
<span class="n">jpype</span><span class="o">.</span><span class="n">startJVM</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">getDefaultJVMPath</span><span class="p">(),</span> <span class="s2">"-Djava.class.path=</span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span> <span class="n">classpath</span><span class="p">)</span>
|
|
|
|
<span class="c1"># get the Java classes we want to use</span>
|
|
<span class="n">DefaultExtractor</span> <span class="o">=</span> <span class="n">jpype</span><span class="o">.</span><span class="n">JPackage</span><span class="p">(</span><span class="s2">"de"</span><span class="p">)</span><span class="o">.</span><span class="n">l3s</span><span class="o">.</span><span class="n">boilerpipe</span><span class="o">.</span><span class="n">extractors</span><span class="o">.</span><span class="n">DefaultExtractor</span>
|
|
|
|
<span class="c1"># call them !</span>
|
|
<span class="k">print</span> <span class="n">DefaultExtractor</span><span class="o">.</span><span class="n">INSTANCE</span><span class="o">.</span><span class="n">getText</span><span class="p">(</span><span class="n">jpype</span><span class="o">.</span><span class="n">java</span><span class="o">.</span><span class="n">net</span><span class="o">.</span><span class="n">URL</span><span class="p">(</span><span class="s2">"http://blog.notmyidea.org"</span><span class="p">))</span>
|
|
</pre></div>
|
|
|
|
|
|
<p>And you get what you want.</p>
|
|
<p>I must say I didn't thought it could work so easily. This will allow me
|
|
to extract text content from URLs and remove the <em>boilerplate</em> text
|
|
easily for infuse (my master thesis project), without having to write
|
|
java code, nice!</p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
<label for="sidebar-checkbox" class="sidebar-toggle"></label>
|
|
|
|
<script>
|
|
(function(document) {
|
|
var i = 0;
|
|
// snip empty header rows since markdown can't
|
|
var rows = document.querySelectorAll('tr');
|
|
for(i=0; i<rows.length; i++) {
|
|
var ths = rows[i].querySelectorAll('th');
|
|
var rowlen = rows[i].children.length;
|
|
if (ths.length > 0 && ths.length === rowlen) {
|
|
rows[i].remove();
|
|
}
|
|
}
|
|
})(document);
|
|
</script>
|
|
|
|
<script>
|
|
/* Lanyon & Poole are Copyright (c) 2014 Mark Otto. Adapted to Pelican 20141223 and extended a bit by @thomaswilley */
|
|
(function(document) {
|
|
var toggle = document.querySelector('.sidebar-toggle');
|
|
var sidebar = document.querySelector('#sidebar');
|
|
var checkbox = document.querySelector('#sidebar-checkbox');
|
|
document.addEventListener('click', function(e) {
|
|
var target = e.target;
|
|
if(!checkbox.checked ||
|
|
sidebar.contains(target) ||
|
|
(target === checkbox || target === toggle)) return;
|
|
checkbox.checked = false;
|
|
}, false);
|
|
})(document);
|
|
</script>
|
|
<!-- Piwik -->
|
|
<script type="text/javascript">
|
|
var _paq = _paq || [];
|
|
_paq.push(['trackPageView']);
|
|
_paq.push(['enableLinkTracking']);
|
|
(function() {
|
|
var u="//tracker.notmyidea.org/";
|
|
_paq.push(['setTrackerUrl', u+'piwik.php']);
|
|
_paq.push(['setSiteId', 3]);
|
|
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
|
|
g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'piwik.js'; s.parentNode.insertBefore(g,s);
|
|
})();
|
|
</script>
|
|
<noscript><p><img src="//tracker.notmyidea.org/piwik.php?idsite=3" style="border:0;" alt="" /></p></noscript>
|
|
<!-- End Piwik Code -->
|
|
</div>
|
|
</body>
|
|
</html> |