Balthazar - Programminghttps://blog.balthazar-rouberol.com/2023-08-30T00:00:00+02:00Just enough Makefile to be dangerous2023-08-30T00:00:00+02:002023-08-30T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2023-08-30:/just-enough-makefile-to-be-dangerous<p>Over the years, I've developed a mix of appreciation and frustration for <code>make</code>. While it's conveniently ubiquitous across UNIX systems and widely used, its syntax often feels perplexing and unwieldy, posing debugging challenges. In this article, I share best practices I've embraced to make working with <code>make</code> a more satisfying experience.</p><div class="toc"><span class="toctitle">Table of Contents</span><ul>
<li><a href="#getting-started-with-make">Getting started with make</a><ul>
<li><a href="#the-step-structure">The step structure</a></li>
<li><a href="#phony-targets">Phony targets</a></li>
<li><a href="#default-target">Default target</a></li>
</ul>
</li>
<li><a href="#my-best-practices">My best practices</a><ul>
<li><a href="#makefile-auto-documentation-as-the-default-step">Makefile auto-documentation as the default step</a></li>
<li><a href="#tell-whats-happening-not-how">Tell what's happening, not how</a></li>
<li><a href="#define-commonalities-in-variables">Define commonalities in variables</a></li>
<li><a href="#keep-all-paths-in-the-makefile">Keep all paths in the Makefile</a></li>
<li><a href="#generate-a-visual-representation-of-the-makefile">Generate a visual representation of the Makefile</a></li>
<li><a href="#keep-things-readable">Keep things readable</a></li>
</ul>
</li>
</ul>
</div>
<p>Over the years, I have developed a bit of a love-hate relationship with <code>make</code>. On the plus side, it is ubiquitous, preinstalled on most UNIX systems, and widely used. On the other hand, its syntax can feel arcane and clunky, and it can prove hard to debug.
In this article, I will go over the basic <code>make</code> concepts, and the set of best practices I've come to embrace as my own, to make <code>make</code> enjoyable to use.</p>
<p>Let's start with the beginning.</p>
<h2 id="getting-started-with-make">Getting started with <code>make</code></h2>
<h3 id="the-step-structure">The step structure</h3>
<p><code>make</code> is a build system: a piece of tooling allowing you to define steps to build your project. It should make sure to only rebuild what needs to be rebuilt, to keep build time as short as possible. All these steps are defined in a file named <code>Makefile</code>, usually located at the root of your project.</p>
<p>A <code>make</code> step has the following syntax:</p>
<div class="highlight"><pre><span></span><code><span class="nf">target</span><span class="o">:</span><span class="w"> </span>[<span class="n">space</span> <span class="n">separated</span> <span class="n">dependencies</span>]
<span class="w"> </span>shell<span class="w"> </span>instructions
<span class="w"> </span>...
</code></pre></div>
<p>By default, <code>make</code> assumes that a target is a <em>file</em>, and will build it by executing the shell instructions associated with that target, after it has executed the shell instructions associated with the possible target dependencies (if any).</p>
<p>Let's have a look at a simple example in which we will build this <code>hello.c</code> file into a <code>hello</code> binary, using the <code>gcc</code> compiler.</p>
<div class="highlight"><pre><span></span><code><span class="cp">#include</span><span class="w"> </span><span class="cpf"><stdio.h></span>
<span class="kt">int</span><span class="w"> </span><span class="nf">main</span><span class="p">()</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">printf</span><span class="p">(</span><span class="s">"hello world</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div>
<p>We define the following <code>Makefile</code>:</p>
<div class="highlight"><pre><span></span><code><span class="nf">hello</span><span class="o">:</span><span class="w"> </span><span class="n">hello</span>.<span class="n">c</span>
<span class="w"> </span>gcc<span class="w"> </span>hello.c<span class="w"> </span>-o<span class="w"> </span>hello
</code></pre></div>
<p>We can then run <code>make hello</code> to compile the <code>hello</code> binary, after which we run it:</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>make<span class="w"> </span>hello
gcc<span class="w"> </span>hello.c<span class="w"> </span>-o<span class="w"> </span>hello
$<span class="w"> </span>./hello
hello<span class="w"> </span>world
</code></pre></div>
<p>When we ran <code>make hello</code>, <code>make</code> detected that the <code>hello</code> file wasn't found on disk, and built it by running <code>gcc hello.c -o hello</code>.</p>
<p>What happens if we re-run the same command now?</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>make<span class="w"> </span>hello
make:<span class="w"> </span><span class="sb">`</span>hello<span class="err">'</span><span class="w"> </span>is<span class="w"> </span>up<span class="w"> </span>to<span class="w"> </span>date.
</code></pre></div>
<p><code>make</code> detected that <code>hello.c</code> hadn't changed since last time <code>hello</code> was built, and thus did nothing. If we change <code>hello.c</code> to print <code>hello bobbytables</code> instead of <code>hello world</code>, <code>make</code> will see that the file had changed and will happily rebuild the binary:</p>
<div class="highlight"><pre><span></span><code>#include <stdio.h>
int main() {
<span class="gd">- printf("hello world\n");</span>
<span class="gi">+ printf("hello bobbytables\n");</span>
<span class="w"> </span> return 0;
}
</code></pre></div>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>make<span class="w"> </span>hello
gcc<span class="w"> </span>hello.c<span class="w"> </span>-o<span class="w"> </span>hello
$<span class="w"> </span>./hello
hello<span class="w"> </span>bobbytables
</code></pre></div>
<h3 id="phony-targets">Phony targets</h3>
<p>Say now that you'd like to define a <code>run</code> step, that will simply run the binary:</p>
<div class="highlight"><pre><span></span><code><span class="nf">hello</span><span class="o">:</span>
<span class="w"> </span>gcc<span class="w"> </span>-o<span class="w"> </span>hello<span class="w"> </span>hello.c
<span class="nf">run</span><span class="o">:</span><span class="w"> </span><span class="n">hello</span>
<span class="w"> </span>./hello
</code></pre></div>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>make<span class="w"> </span>run
./hello
hello<span class="w"> </span>world
</code></pre></div>
<p>The issue here, is that <code>run</code> does not represent a file on disk. To avoid confusing <code>make</code>, we mark this step as being <code>PHONY</code>, aka not a file <code>make</code> needs to build. This will make sure the associated shell instructions are always executed.</p>
<div class="highlight"><pre><span></span><code><span class="nf">hello</span><span class="o">:</span>
<span class="w"> </span>gcc<span class="w"> </span>-o<span class="w"> </span>hello<span class="w"> </span>hello.c
<span class="nf">.PHONY</span><span class="o">:</span><span class="w"> </span><span class="n">run</span>
<span class="nf">run</span><span class="o">:</span><span class="w"> </span><span class="n">hello</span>
<span class="w"> </span>./hello
</code></pre></div>
<h3 id="default-target">Default target</h3>
<p>We can define what step should be run when invocating <code>make</code> without any argument by using <code>.DEFAULT_GOAL</code>:</p>
<div class="highlight"><pre><span></span><code><span class="nv">.DEFAULT_GOAL</span><span class="w"> </span><span class="o">=</span><span class="w"> </span>run
<span class="nf">hello</span><span class="o">:</span>
<span class="w"> </span>gcc<span class="w"> </span>-o<span class="w"> </span>hello<span class="w"> </span>hello.c
<span class="nf">.PHONY</span><span class="o">:</span><span class="w"> </span><span class="n">run</span>
<span class="nf">run</span><span class="o">:</span><span class="w"> </span><span class="n">hello</span>
<span class="w"> </span>./hello
</code></pre></div>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>make
./hello
hello<span class="w"> </span>world
</code></pre></div>
<div class="Note">
<p>We can hide the command being executed by prefixing it with <code>@</code>.</p>
<div class="highlight"><pre><span></span><code><span class="nv">.DEFAULT_GOAL</span><span class="w"> </span><span class="o">=</span><span class="w"> </span>run
<span class="nf">hello</span><span class="o">:</span>
<span class="w"> </span>gcc<span class="w"> </span>-o<span class="w"> </span>hello<span class="w"> </span>hello.c
<span class="nf">.PHONY</span><span class="o">:</span><span class="w"> </span><span class="n">run</span>
<span class="nf">run</span><span class="o">:</span><span class="w"> </span><span class="n">hello</span>
<span class="w"> </span>@./hello
</code></pre></div>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>make
hello<span class="w"> </span>world
</code></pre></div>
</div>
<p>And with that, we now know just enough to get started for real.</p>
<h2 id="my-best-practices">My best practices</h2>
<div class="Note">
<p>The examples are taken from the <a href="https://github.com/brouberol/5esheets"><code>5esheets</code></a> <a href="https://github.com/brouberol/5esheets/blob/main/Makefile">Makefile</a>.</p>
</div>
<h3 id="makefile-auto-documentation-as-the-default-step">Makefile auto-documentation as the default step</h3>
<p>Ever since I stumbled on this <a href="https://marmelab.com/blog/2016/02/29/auto-documented-makefile.html">article</a>, I have made sure to auto-document all my <code>Makefile</code>s, to help with discoverability. This works by adding a one-liner explanation of the "public" targets (the one a contributor might find themselves executing) after a <code>##</code>. We then define a <code>help</code> target that will parse the current <code>Makefile</code>, extract all the target names and associated comments, and format them nicely. The finishing touch is to make <code>help</code> the default target, to make it extra easy for a newcomer to understand what can be built with your <code>Makefile</code>.</p>
<div class="highlight"><pre><span></span><code><span class="nv">.DEFAULT_GOAL</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">help</span>
<span class="err">...</span>
<span class="nf">run</span><span class="o">:</span><span class="w"> </span><span class="n">admin</span>-<span class="n">statics</span> <span class="n">build</span> <span class="c">## Run the app</span>
<span class="w"> </span>...
<span class="nf">help</span><span class="o">:</span><span class="w"> </span><span class="c">## Display help</span>
<span class="w"> </span>@grep<span class="w"> </span>-E<span class="w"> </span><span class="s1">'^[a-zA-Z_-]+:.*?## .*$$'</span><span class="w"> </span><span class="k">$(</span>MAKEFILE_LIST<span class="k">)</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sort<span class="w"> </span><span class="p">|</span><span class="w"> </span>awk<span class="w"> </span><span class="s1">'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}'</span>
</code></pre></div>
<p>This is what the output looks like for the <code>5esheets</code> project:</p>
<p><img alt="" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/makefiles/autodoc.webp"></p>
<h3 id="tell-whats-happening-not-how">Tell what's happening, not how</h3>
<p>I personally like to have each step include a short explanation of what it is doing, and hide the actual shell command, which I find of low value.</p>
<div class="highlight"><pre><span></span><code><span class="nf">deps-python</span><span class="o">:</span><span class="w"> </span><span class="n">poetry</span>.<span class="n">lock</span>
<span class="w"> </span>@echo<span class="w"> </span><span class="s2">"\n[+] Installing python dependencies"</span>
<span class="w"> </span>@poetry<span class="w"> </span>install
</code></pre></div>
<p>In that example, when the target executes, I see <code>[+] Installing python dependencies</code>, as well at the command output, but not the <code>poetry install</code> command itself. I find that communicating the <em>intent</em> is clearer and more self-explanatory than taking screen real-estate by displaying the nitty-gritty details.</p>
<h3 id="define-commonalities-in-variables">Define commonalities in variables</h3>
<p>When I find myself repeating things too much in various rules, this is when I start using variables. For example, instead of writing many rules that hardcode a given directory name in them, I define that directory name in a variable. This makes it easier to keep the <code>Makefile</code> valid when the project structure evolves.</p>
<div class="highlight"><pre><span></span><code><span class="nv">app-root</span><span class="w"> </span><span class="o">=</span><span class="w"> </span>dnd5esheets
<span class="nf">black</span><span class="o">:</span>
<span class="w"> </span>@echo<span class="w"> </span><span class="s2">"\n[+] Reformatting python files"</span>
<span class="w"> </span>@poetry<span class="w"> </span>run<span class="w"> </span>black<span class="w"> </span>--check<span class="w"> </span><span class="k">$(</span>app-root<span class="k">)</span>/
<span class="nf">mypy</span><span class="o">:</span>
<span class="w"> </span>@echo<span class="w"> </span><span class="s2">"\n[+] Checking Python types"</span>
<span class="w"> </span>@poetry<span class="w"> </span>run<span class="w"> </span>mypy<span class="w"> </span><span class="k">$(</span>app-root<span class="k">)</span>/
<span class="nf">ruff</span><span class="o">:</span>
<span class="w"> </span>@echo<span class="w"> </span><span class="s2">"\n[+] Running linter"</span>
<span class="w"> </span>@poetry<span class="w"> </span>run<span class="w"> </span>ruff<span class="w"> </span><span class="k">$(</span>app-root<span class="k">)</span>/
</code></pre></div>
<h3 id="keep-all-paths-in-the-makefile">Keep all paths in the Makefile</h3>
<p>Some of my targets are oftentimes generated via scripts (usually python), which process some input and dump their result to a target file. I find that passing the output file path to the script (instead of hardcoding the file path in the script) allows the <code>Makefile</code> to be more self-contained and makes it easier to rename files without having to update both the <code>Makefile</code> <em>and</em> the script.</p>
<div class="highlight"><pre><span></span><code><span class="nf">$(data-dir)/translations-items-fr.json</span><span class="o">:</span>
<span class="w"> </span>@echo<span class="w"> </span><span class="s2">"\n[+] Fetching items french translations"</span>
<span class="w"> </span>@curl<span class="w"> </span>-s<span class="w"> </span><span class="k">$(</span>fr-translations-data-dir<span class="k">)</span>/dnd5e.items.json<span class="w"> </span>><span class="w"> </span><span class="k">$(</span>data-dir<span class="k">)</span>/translations-items-fr.json
<span class="nf">$(data-dir)/items-base.json</span><span class="o">:</span><span class="w"> </span><span class="k">$(</span><span class="nv">data-dir</span><span class="k">)</span>/<span class="n">translations</span>-<span class="n">items</span>-<span class="n">fr</span>.<span class="n">json</span>
<span class="w"> </span>@echo<span class="w"> </span><span class="s2">"\n[+] Fetching base equipment data"</span>
<span class="w"> </span>@curl<span class="w"> </span>-s<span class="w"> </span><span class="k">$(</span>5etools-data-dir<span class="k">)</span>/items-base.json<span class="w"> </span><span class="p">|</span><span class="w"> </span>./scripts/preprocess_base_item_json.py<span class="w"> </span><span class="k">$(</span>data-dir<span class="k">)</span>/items-base.json
</code></pre></div>
<p>We can then avoid repeating ourselves by leveraging the <code>$@</code> symbol, which expands to the name of the target being generated.</p>
<div class="highlight"><pre><span></span><code><span class="nf">$(data-dir)/translations-items-fr.json</span><span class="o">:</span>
<span class="w"> </span>@echo<span class="w"> </span><span class="s2">"\n[+] Fetching items french translations"</span>
<span class="w"> </span>@curl<span class="w"> </span>-s<span class="w"> </span><span class="k">$(</span>fr-translations-data-dir<span class="k">)</span>/dnd5e.items.json<span class="w"> </span>><span class="w"> </span><span class="nv">$@</span>
<span class="nf">$(data-dir)/items-base.json</span><span class="o">:</span><span class="w"> </span><span class="k">$(</span><span class="nv">data-dir</span><span class="k">)</span>/<span class="n">translations</span>-<span class="n">items</span>-<span class="n">fr</span>.<span class="n">json</span>
<span class="w"> </span>@echo<span class="w"> </span><span class="s2">"\n[+] Fetching base equipment data"</span>
<span class="w"> </span>@curl<span class="w"> </span>-s<span class="w"> </span><span class="k">$(</span>5etools-data-dir<span class="k">)</span>/items-base.json<span class="w"> </span><span class="p">|</span><span class="w"> </span>./scripts/preprocess_base_item_json.py<span class="w"> </span><span class="nv">$@</span>
</code></pre></div>
<h3 id="generate-a-visual-representation-of-the-makefile">Generate a visual representation of the Makefile</h3>
<p>I like having a visual representation of the dependencies of each target. It allows me to debug why some targets are not being rebuilt when they should, or are always being rebuilt when they shouldn't be. I find that it it also helps when getting started with the project for the first time. I leverage the <a href="https://pypi.org/project/makefile2dot/"><code>makefile2dot</code></a> Python package for this:</p>
<div class="highlight"><pre><span></span><code><span class="nf">doc/makefile.png</span><span class="o">:</span><span class="w"> </span><span class="n">Makefile</span>
<span class="w"> </span>@echo<span class="w"> </span><span class="s2">"\n[+] Generating a visual graph representation of the Makefile"</span>
<span class="w"> </span>@poetry<span class="w"> </span>run<span class="w"> </span>makefile2dot<span class="w"> </span>-o<span class="w"> </span><span class="nv">$@</span>
</code></pre></div>
<div class="Note">
<p>You'll notice that this target depends on the <code>Makefile</code> itself, as it needs to be re-generated as the <code>Makefile</code> evolves.</p>
</div>
<p><img alt="" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/makefiles/makefile.webp"></p>
<h3 id="keep-things-readable">Keep things readable</h3>
<p>This is probably my most fundamental best practice.</p>
<p>Over the years, I have realized that I'm not smart enough to maintain a cryptic-looking <code>Makefile</code>. I my view, articles such as <a href="https://www.cs.colby.edu/maxwell/courses/tutorials/maketutor/">this one</a> steer the reader into producing "smart" Makefiles that are non obvious to reason about (especially the last example). I need to be able to read a target's logic and understand what it does months after having written it. The same way, I won't hesitate to repeat myself and avoid variables when I think the output looks clearer. I try not to use <a href="https://devhints.io/makefile">"magic variables"</a> too much.</p>
<p>There's a delicate balance to be struck between expressibility and readability, and I think readability should <em>always</em> win. You'll thank yourself later.</p>Pinning your SQLite version across environments2023-08-25T00:00:00+02:002023-08-25T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2023-08-25:/pinning-your-sqlite-version-across-environments<p>This article discusses the challenges of maintaining consistent versions of the SQLite library in different environments for a project that relies heavily on it. Unlike traditional databases, where server versions can be easily pinned, SQLite is embedded in applications, leading to potential feature mismatches due to what version is made available by each environment's system package manager.</p><p>The <a href="https://github.com/brouberol/5esheets">project</a> I'm currrently working on only has a single external dependency: <a href="https://www.sqlite.org/">SQLite</a>, with <a href="https://www.sqlite.org/fts5.html">full text search</a> enabled. As a result, the application is extremely easy to package and run. However, I found out that ensuring that you have the <em>exact same</em> SQLite <a href="https://github.com/brouberol/5esheets/pull/207#issuecomment-1672131123">version and feature</a> set in all your environments (development machines running macOS and linux, CI and production) is trickier than I expected.</p>
<p>When you rely on a traditional database server (PostgreSQL, MySQL, mongoDB, etc), you can achieve this by running the same server version in all your environments.</p>
<div class="Note">
<p>Docker really shines there, as it allows to do just that in a single command.</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>docker<span class="w"> </span>run<span class="w"> </span>postgres:15.4
</code></pre></div>
</div>
<p>Things are a bit different with SQLite, as it is <em>not</em> an SQL server. It is a <em>library</em> that you embed in your program (either by compiling it alongside your code, or by relying on a shared library and language bindings). Python does the latter: its <code>sqlite3</code> module is written in C using the CPython API, and <a href="https://github.com/python/cpython/blob/4ae3edf3008b70e20663143553a736d80ff3a501/Modules/_sqlite/connection.h#L32">includes</a> the <code>sqlite3.h</code> header file. Where does this header file come from though?</p>
<h3 id="inspecting-the-sqlite-version-on-linux">Inspecting the sqlite version on linux</h3>
<p>If we have a look at a <code>python3.11</code> installation directory on a random Ubuntu server, we see that it bundles an <code>_sqlite.so</code> shared object, that itself dynamically loads <code>libsqlite3.so.0</code>.</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span><span class="nb">find</span><span class="w"> </span>/usr/lib/python3.11<span class="w"> </span>-name<span class="w"> </span><span class="s2">"*sqlite3*.so"</span>
/usr/lib/python3.11/lib-dynload/_sqlite3.cpython-311-x86_64-linux-gnu.so
$<span class="w"> </span><span class="nb">ldd</span><span class="w"> </span>/usr/lib/python3.11/lib-dynload/_sqlite3.cpython-311-x86_64-linux-gnu.so
<span class="w"> </span>linux-vdso.so.1<span class="w"> </span><span class="o">(</span>0x00007ffcda976000<span class="o">)</span>
<span class="w"> </span>libsqlite3.so.0<span class="w"> </span><span class="o">=</span><span class="k">></span><span class="w"> </span>/lib/x86_64-linux-gnu/libsqlite3.so.0<span class="w"> </span><span class="o">(</span>0x00007fab44d9c000<span class="o">)</span><span class="w"> </span><span class="c1"># <--</span>
<span class="w"> </span>libc.so.6<span class="w"> </span><span class="o">=</span><span class="k">></span><span class="w"> </span>/lib/x86_64-linux-gnu/libc.so.6<span class="w"> </span><span class="o">(</span>0x00007fab44a00000<span class="o">)</span>
<span class="w"> </span>libm.so.6<span class="w"> </span><span class="o">=</span><span class="k">></span><span class="w"> </span>/lib/x86_64-linux-gnu/libm.so.6<span class="w"> </span><span class="o">(</span>0x00007fab44cb3000<span class="o">)</span>
<span class="w"> </span>/lib64/ld-linux-x86-64.so.2<span class="w"> </span><span class="o">(</span>0x00007fab44f17000<span class="o">)</span>
</code></pre></div>
<p>Same question: where does <code>/lib/x86_64-linux-gnu/libsqlite3.so.0</code> come from then?</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span><span class="nb">apt-file</span><span class="w"> </span>search<span class="w"> </span>/lib/x86_64-linux-gnu/libsqlite3.so.0
libsqlite3-0:<span class="w"> </span>/usr/lib/x86_64-linux-gnu/libsqlite3.so.0
libsqlite3-0:<span class="w"> </span>/usr/lib/x86_64-linux-gnu/libsqlite3.so.0.8.6
$<span class="w"> </span><span class="nb">apt-cache</span><span class="w"> </span>search<span class="w"> </span>libsqlite3-0
libsqlite3-0<span class="w"> </span>-<span class="w"> </span>SQLite<span class="w"> </span><span class="m">3</span><span class="w"> </span>shared<span class="w"> </span>library
</code></pre></div>
<p>This means that python relies on whatever <code>libsqlite3</code> version is installed by the <em>system package manager</em>. We can double check this by having a look at the <code>python3</code> package recursive dependencies: <a href="https://packages.ubuntu.com/lunar/python3"><code>python3</code></a> -> <a href="https://packages.ubuntu.com/lunar/libpython3-stdlib"><code>libpython3-stdlib</code></a> -> <a href="https://packages.ubuntu.com/lunar/libpython3.11-stdlib"><code>libpython3.11-stdlib</code></a> -> <a href="https://packages.ubuntu.com/lunar/libsqlite3-0"><code>libsqlite3-0</code></a>.</p>
<p>To know what version is installed on that system, we can inspect the version of the <code>libsqlite3-0</code> apt package:</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span><span class="nb">apt-cache</span><span class="w"> </span>show<span class="w"> </span>libsqlite3-0<span class="w"> </span><span class="k">|</span><span class="w"> </span><span class="nb">grep</span><span class="w"> </span>Version
Version:<span class="w"> </span><span class="m">3</span>.40.1-1
</code></pre></div>
<p>We can check that we're getting this exact version via python:</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">sqlite3</span>
<span class="o">>>></span> <span class="n">conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="s2">":memory:"</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">conn</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">"select sqlite_version()"</span><span class="p">)</span><span class="o">.</span><span class="n">fetchone</span><span class="p">()</span>
<span class="p">(</span><span class="s1">'3.40.1'</span><span class="p">,)</span>
</code></pre></div>
<h3 id="inspecting-the-sqlite-version-on-macos">Inspecting the sqlite version on macOS</h3>
<p>Assuming you are installing your packages via <code>brew</code> on macOS, you'll find that it does things a bit differently than <code>apt</code>. The <code>python3</code> formula <a href="https://github.com/Homebrew/homebrew-core/blob/1aa36b1d93b4ee968d8d355640735f5ec21e7262/Formula/p/python@3.11.rb#L30">depends on <code>sqlite</code></a>, which itself <a href="https://github.com/Homebrew/homebrew-core/blob/1aa36b1d93b4ee968d8d355640735f5ec21e7262/Formula/s/sqlite.rb#L4">downloads</a> an archive pinned to a given version (<code>3.43.0</code> at the time of writing), and then <a href="https://github.com/Homebrew/homebrew-core/blob/1aa36b1d93b4ee968d8d355640735f5ec21e7262/Formula/s/sqlite.rb#L36-L56">compiles <code>libsqlite3.dylib</code></a>.</p>
<p>Indeed, we see this library when inspecting the content of the <code>sqlite</code> brew package:</p>
<div class="highlight"><pre><span></span><code><span class="k">~</span><span class="w"> </span>❯<span class="w"> </span><span class="nb">ls</span><span class="w"> </span>-alh<span class="w"> </span>/opt/homebrew/opt/sqlite/lib/libsqlite3.dylib
lrwxr-xr-x<span class="w"> </span><span class="m">18</span><span class="w"> </span>br<span class="w"> </span><span class="m">16</span><span class="w"> </span>May<span class="w"> </span><span class="m">15</span>:45<span class="w"> </span>/opt/homebrew/opt/sqlite/lib/libsqlite3.dylib<span class="w"> </span>-><span class="w"> </span>libsqlite3.0.dylib
</code></pre></div>
<p>And sure enough, we see that we're running the expected version in python:</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">sqlite3</span>
<span class="o">>>></span> <span class="n">conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="s2">":memory:"</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">conn</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s2">"select sqlite_version()"</span><span class="p">)</span><span class="o">.</span><span class="n">fetchone</span><span class="p">()</span>
<span class="p">(</span><span class="s1">'3.43.0'</span><span class="p">,)</span>
</code></pre></div>
<h3 id="pinning-the-sqlite-version-by-vendoring-the-compiled-library">Pinning the sqlite version by vendoring the compiled library</h3>
<p>To pin the <code>sqlite</code> version across all environments and OSes, we can compile these shared/dynamically loaded libraries ourselves for all architectures we plan to support, vendor them in our codebase, and inject them into our application via <code>LD_PRELOAD</code>.</p>
<p>We'd need to cover all the ways we run the app:</p>
<ul>
<li>running <code>make run</code>, which runs the app on the host, against the version of <code>libsqlite3</code> installed by the package manager</li>
<li>running <code>make docker-run</code>, which runs the application in a docker container against the <code>libsqlite3</code> version available through the image OS package manager</li>
<li>running <code>make test</code> in CI (Github Actions), which runs the test against the <code>libsqlite3</code> version made available by the runner OS package manager</li>
</ul>
<p>Compiling the sqlite source code into a shared library was made easy to do as <a href="https://simonwillison.net/">Simon Willison</a> already <a href="https://til.simonwillison.net/sqlite/sqlite-version-macos-python">documented</a> the process.</p>
<h4 id="compiling-libsqlite3-for-linux">Compiling <code>libsqlite3</code> for linux</h4>
<p>The following script compiles <code>libsqlite3</code> for linux, with full text search enabled:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># script/compile-libsqlite-linux.sh</span>
<span class="c1">#!/usr/bin/env bash</span>
<span class="nb">set</span><span class="w"> </span>-e
apt-get<span class="w"> </span>install<span class="w"> </span>-y<span class="w"> </span>build-essential<span class="w"> </span>wget<span class="w"> </span>tcl
<span class="c1"># link associated with sqlite 3.42.0, found on https://www.sqlite.org/src/timeline?t=version-3.42.0</span>
<span class="c1"># pointing to https://www.sqlite.org/src/info/831d0fb2836b71c9</span>
<span class="nv">sqlite_ref</span><span class="o">=</span>831d0fb2
wget<span class="w"> </span>https://www.sqlite.org/src/tarball/<span class="si">${</span><span class="nv">sqlite_ref</span><span class="si">}</span>/SQLite-<span class="si">${</span><span class="nv">sqlite_ref</span><span class="si">}</span>.tar.gz
tar<span class="w"> </span>-xzvf<span class="w"> </span>SQLite-<span class="si">${</span><span class="nv">sqlite_ref</span><span class="si">}</span>.tar.gz
<span class="nb">pushd</span><span class="w"> </span>SQLite-<span class="si">${</span><span class="nv">sqlite_ref</span><span class="si">}</span>
<span class="nv">CPPFLAGS</span><span class="o">=</span><span class="s2">"-DSQLITE_ENABLE_FTS5"</span><span class="w"> </span>./configure
make
<span class="nb">popd</span>
<span class="nb">mv</span><span class="w"> </span>SQLite-<span class="si">${</span><span class="nv">sqlite_ref</span><span class="si">}</span>/.libs/libsqlite3.so<span class="w"> </span>./lib/
<span class="nb">rm</span><span class="w"> </span>-r<span class="w"> </span>SQLite-<span class="si">${</span><span class="nv">sqlite_ref</span><span class="si">}</span>.tar.gz<span class="w"> </span>SQLite-<span class="si">${</span><span class="nv">sqlite_ref</span><span class="si">}</span>
</code></pre></div>
<h4 id="compiling-libsqlite3-for-macos">Compiling <code>libsqlite3</code> for macOS</h4>
<p>The following script compiles <code>libsqlite3</code> for macOS, with full text search enabled:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># script/compile-libsqlite-macos.sh</span>
<span class="c1">#!/usr/bin/env bash</span>
<span class="nb">set</span><span class="w"> </span>-eu
<span class="nv">sqlite_version</span><span class="o">=</span><span class="m">3420000</span>
wget<span class="w"> </span>https://www.sqlite.org/2023/sqlite-amalgamation-<span class="si">${</span><span class="nv">sqlite_version</span><span class="si">}</span>.zip
unzip<span class="w"> </span>sqlite-amalgamation-<span class="si">${</span><span class="nv">sqlite_version</span><span class="si">}</span>.zip
<span class="nb">pushd</span><span class="w"> </span>sqlite-amalgamation-<span class="si">${</span><span class="nv">sqlite_version</span><span class="si">}</span>
gcc<span class="w"> </span>-dynamiclib<span class="w"> </span>sqlite3.c<span class="w"> </span>-o<span class="w"> </span>libsqlite3.0.dylib<span class="w"> </span>-lm<span class="w"> </span>-lpthread<span class="w"> </span>-DSQLITE_ENABLE_FTS5
<span class="nb">popd</span>
<span class="nb">mv</span><span class="w"> </span>sqlite-amalgamation-<span class="si">${</span><span class="nv">sqlite_version</span><span class="si">}</span>/libsqlite3.0.dylib<span class="w"> </span>./lib/
<span class="nb">rm</span><span class="w"> </span>-r<span class="w"> </span>sqlite-amalgamation-<span class="si">${</span><span class="nv">sqlite_version</span><span class="si">}</span>.zip<span class="w"> </span>sqlite-amalgamation-<span class="si">${</span><span class="nv">sqlite_version</span><span class="si">}</span>
</code></pre></div>
<h4 id="compiling-the-right-version-on-demand">Compiling the right version on-demand</h4>
<p>We then define a <code>$(libsqlite)</code> <code>make</code> target, either pointing to <code>lib/libsqlite3.so</code> if you run the app on linux, or <code>lib/libsqlite3.0.dylib</code> if you run it on macOS. We finally make sure to override the system shared library by the vendored one when running the app, via <code>LD_PRELOAD</code> on linux and <code>DYLD_LIBRARY_PATH</code> on macOS.</p>
<div class="highlight"><pre><span></span><code><span class="c"># Makefile</span>
<span class="nv">UNAME_S</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="k">$(</span>shell<span class="w"> </span>uname<span class="w"> </span>-s<span class="k">)</span>
<span class="nv">PWD</span><span class="w"> </span><span class="o">:=</span><span class="w"> </span><span class="k">$(</span>shell<span class="w"> </span><span class="nb">pwd</span><span class="k">)</span>
<span class="cp">ifeq ($(UNAME_S),Linux)</span>
<span class="w"> </span><span class="nv">libsqlite</span><span class="w"> </span><span class="o">=</span><span class="w"> </span>lib/libsqlite3.so
<span class="w"> </span><span class="nv">ld_preload</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">LD_PRELOAD</span><span class="o">=</span><span class="k">$(</span>PWD<span class="k">)</span>/<span class="k">$(</span>libsqlite<span class="k">)</span>
<span class="cp">else ifeq ($(UNAME_S),Darwin)</span>
<span class="w"> </span><span class="nv">libsqlite</span><span class="w"> </span><span class="o">=</span><span class="w"> </span>lib/libsqlite3.0.dylib
<span class="w"> </span><span class="nv">ld_preload</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">DYLD_LIBRARY_PATH</span><span class="o">=</span><span class="k">$(</span>PWD<span class="k">)</span>/lib
<span class="cp">endif</span>
<span class="nv">app-root</span><span class="w"> </span><span class="o">=</span><span class="w"> </span>dnd5esheets
<span class="nv">poetry-run</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">$(</span>ld_preload<span class="k">)</span><span class="w"> </span>poetry<span class="w"> </span>run
<span class="nf">lib/libsqlite3.so</span><span class="o">:</span>
<span class="w"> </span>@./scripts/compile-libsqlite-linux.sh
<span class="nf">lib/libsqlite3.0.dylib</span><span class="o">:</span>
<span class="w"> </span>@./scripts/compile-libsqlite-macos.sh
<span class="nf">build</span><span class="o">:</span><span class="w"> </span><span class="k">$(</span><span class="nv">libsqlite</span><span class="k">)</span> ...
<span class="nf">test</span><span class="o">:</span>
<span class="w"> </span>@<span class="k">$(</span>poetry-run<span class="k">)</span><span class="w"> </span>pytest
<span class="nf">run</span><span class="o">:</span><span class="w"> </span><span class="n">build</span> ...
<span class="w"> </span>@cd<span class="w"> </span><span class="k">$(</span>app-root<span class="k">)</span><span class="w"> </span><span class="o">&&</span><span class="w"> </span><span class="k">$(</span>poetry-run<span class="k">)</span><span class="w"> </span>uvicorn<span class="w"> </span>--factory<span class="w"> </span><span class="k">$(</span>app-root<span class="k">)</span>.app:create_app<span class="w"> </span>--reload
</code></pre></div>
<h4 id="compiling-libsqlite3-in-docker">Compiling <code>libsqlite3</code> in docker</h4>
<p>While the previous steps work, they also prove to be quite brittle, as they only works for a given CPU architecture. For example, the <code>libsqlite3.0.dylib</code> library will not load on an Intel Mac if it was compiled on a M1 or M2.</p>
<p>The most robust way to go remains building <code>libsqlite3</code> in a <a href="https://docs.docker.com/build/building/multi-stage/">build stage</a> of the docker image process. This way, you <em>know</em> that you only need to build it for linux, whatever the host OS is, and you're guaranteed that it will be built for your CPU architecture, thanks to the <a href="https://hub.docker.com/layers/library/python/3.11.4-slim-bullseye/images/sha256-1226f32ad1c1c36e0b6e79706059761c58ada308f4a1ad798e55dab346e10e91?context=explore">multi-arch property</a> of the <code>python:3.11.4-slim</code> base image.</p>
<div class="highlight"><pre><span></span><code><span class="c"># Dockerfile</span>
...
<span class="c"># -- Build the libsqlite3.so shared object for the appropriate architecture</span>
<span class="k">FROM</span><span class="w"> </span><span class="s">python:3.11.4-slim</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="s">sqlite-build</span>
<span class="k">WORKDIR</span><span class="w"> </span><span class="s">/app/src/build</span>
<span class="k">COPY</span><span class="w"> </span>scripts/compile-libsqlite-linux.sh<span class="w"> </span>./
<span class="k">RUN</span><span class="w"> </span>apt-get<span class="w"> </span>update<span class="w"> </span><span class="o">&&</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>apt-get<span class="w"> </span>install<span class="w"> </span>--no-install-recommends<span class="w"> </span>-y<span class="w"> </span>build-essential<span class="w"> </span>wget<span class="w"> </span>tcl<span class="w"> </span><span class="o">&&</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>./compile-libsqlite-linux.sh<span class="w"> </span><span class="o">&&</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>apt-get<span class="w"> </span>remove<span class="w"> </span>-y<span class="w"> </span>build-essential<span class="w"> </span>wget<span class="w"> </span>tcl<span class="w"> </span><span class="o">&&</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>apt-get<span class="w"> </span>auto-clean
<span class="c"># -- Main build combining the FastAPI and compiled frontend apps</span>
<span class="k">FROM</span><span class="w"> </span><span class="s">python:3.11.4-slim</span>
...
<span class="k">COPY</span><span class="w"> </span>--from<span class="o">=</span>sqlite-build<span class="w"> </span>/app/src/build/libsqlite3.so<span class="w"> </span>./lib/libsqlite3.so
<span class="k">CMD</span><span class="w"> </span><span class="p">[</span><span class="s2">"./start-app.sh"</span><span class="p">]</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="c1"># start-app.sh</span>
<span class="c1">#!/bin/bash</span>
<span class="nb">set</span><span class="w"> </span>-e
<span class="nb">exec</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>env<span class="w"> </span><span class="nv">LD_PRELOAD</span><span class="o">=</span>./lib/libsqlite3.so<span class="w"> </span><span class="se">\ </span><span class="c1"># inject the LD_PRELOAD environment variable in the process</span>
<span class="w"> </span>uvicorn<span class="w"> </span>--factory<span class="w"> </span>dnd5esheets.app:create_app<span class="w"> </span>--host<span class="w"> </span><span class="s2">"0.0.0.0"</span><span class="w"> </span>--port<span class="w"> </span><span class="m">8000</span>
</code></pre></div>
<h3 id="unit-testing-the-sqlite-version-and-feature-set">Unit testing the SQLite version and feature set</h3>
<p>With all of that said and done, we can now expose the <code>sqlite</code> version and compilation options through a debug API handler:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># dnd5esheets/api/debug.py</span>
<span class="kn">from</span> <span class="nn">fastapi</span> <span class="kn">import</span> <span class="n">APIRouter</span><span class="p">,</span> <span class="n">Depends</span>
<span class="kn">from</span> <span class="nn">sqlalchemy</span> <span class="kn">import</span> <span class="n">text</span>
<span class="kn">from</span> <span class="nn">sqlalchemy.ext.asyncio</span> <span class="kn">import</span> <span class="n">AsyncSession</span>
<span class="kn">from</span> <span class="nn">dnd5esheets.db</span> <span class="kn">import</span> <span class="n">create_scoped_session</span>
<span class="n">debug_api</span> <span class="o">=</span> <span class="n">APIRouter</span><span class="p">(</span><span class="n">prefix</span><span class="o">=</span><span class="s2">"/debug"</span><span class="p">)</span>
<span class="nd">@debug_api</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">"/sqlite"</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">sqlite_info</span><span class="p">(</span>
<span class="n">session</span><span class="p">:</span> <span class="n">AsyncSession</span> <span class="o">=</span> <span class="n">Depends</span><span class="p">(</span><span class="n">create_scoped_session</span><span class="p">),</span>
<span class="p">):</span>
<span class="w"> </span><span class="sd">"""Return debug information about the sqlite database"""</span>
<span class="n">version</span> <span class="o">=</span> <span class="p">(</span><span class="k">await</span> <span class="n">session</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">text</span><span class="p">(</span><span class="s2">"select sqlite_version()"</span><span class="p">)))</span><span class="o">.</span><span class="n">scalar_one</span><span class="p">()</span>
<span class="n">pragma_compile_options</span> <span class="o">=</span> <span class="p">(</span>
<span class="p">(</span><span class="k">await</span> <span class="n">session</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">text</span><span class="p">(</span><span class="s2">"pragma compile_options"</span><span class="p">)))</span><span class="o">.</span><span class="n">scalars</span><span class="p">()</span><span class="o">.</span><span class="n">all</span><span class="p">()</span>
<span class="p">)</span>
<span class="k">return</span> <span class="p">{</span>
<span class="s2">"version"</span><span class="p">:</span> <span class="n">version</span><span class="p">,</span>
<span class="s2">"compile_options"</span><span class="p">:</span> <span class="n">pragma_compile_options</span><span class="p">,</span>
<span class="p">}</span>
</code></pre></div>
<p>We can then query the <code>sqlite</code> version through the API:</p>
<div class="highlight"><pre><span></span><code>❯<span class="w"> </span>curl<span class="w"> </span>-s<span class="w"> </span>localhost:8000/api/debug/sqlite<span class="w"> </span><span class="k">|</span><span class="w"> </span>jq<span class="w"> </span>.version
<span class="s2">"3.42.0"</span>
</code></pre></div>
<p>However, we can go even further! By unit-testing the version and compile options, we ensure that our CI uses the exact required <code>sqlite</code> version and feature set.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># dnd5esheets/tests/test_api_debug.py</span>
<span class="k">def</span> <span class="nf">test_sqlite_version</span><span class="p">(</span><span class="n">client</span><span class="p">):</span>
<span class="n">sqlite_debug_info</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">"/api/debug/sqlite"</span><span class="p">)</span><span class="o">.</span><span class="n">json</span><span class="p">()</span>
<span class="k">assert</span> <span class="n">sqlite_debug_info</span><span class="p">[</span><span class="s2">"version"</span><span class="p">]</span> <span class="o">==</span> <span class="s2">"3.42.0"</span>
<span class="k">assert</span> <span class="s2">"ENABLE_FTS5"</span> <span class="ow">in</span> <span class="n">sqlite_debug_info</span><span class="p">[</span><span class="s2">"compile_options"</span><span class="p">]</span>
</code></pre></div>
<div class="Note">
<p>See the effect of vendoring the compiled library in CI: <a href="https://github.com/brouberol/5esheets/actions/runs/5956139650/job/16156271800#step:8:49">before</a> / <a href="https://github.com/brouberol/5esheets/actions/runs/5960929753/job/16169158344#step:8:24">after</a>.</p>
</div>
<h3 id="sources">Sources</h3>
<ul>
<li><a href="https://til.simonwillison.net/sqlite/python-sqlite-environment">https://til.simonwillison.net/sqlite/python-sqlite-environment</a></li>
<li><a href="https://til.simonwillison.net/sqlite/sqlite-version-macos-python">https://til.simonwillison.net/sqlite/sqlite-version-macos-python</a></li>
</ul>How to profile a FastAPI asynchronous request2023-08-05T00:00:00+02:002023-08-05T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2023-08-05:/how-to-profile-a-fastapi-asynchronous-request<p>In this article, I share the challenges I faced when trying to profile requests in an asynchronous FastAPI server. The traditional profiler, <code>cProfile</code>, provided inaccurate results due to the nature of asynchronous functions, which resulted in misleading statistics. To overcome this, I explored <code>pyinstrument</code>, a statistical profiler with built-in support for asynchronous Python code.</p><p>I have been experimenting with <a href="https://fastapi.tiangolo.com/">FastAPI</a> recently, a Python API framework self-describing as "high performance, easy to learn, fast to code, ready for production".</p>
<p>One of the features I wanted my <a href="https://github.com/brouberol/5esheets">project</a> to have is to be fully asynchronous, from the app server to the SQL requests. As the API is mostly I/O bound, this would allow it to handle many concurrent requests with a single server process, instead of starting a thread per request, as one commony seen with Flask/Gunicorn.</p>
<p>However, this poses a challenge when it comes to <em>profiling</em> the code and interpreting the results.</p>
<h3 id="the-limitations-of-cprofile-when-profiling-asynchronous-code">The limitations of <code>cProfile</code> when profiling asynchronous code</h3>
<p>For example, the following graph representation was generated from a <code>cProfile</code> profile recording 300 consecutive calls to a single API endpoint, with an associated <code>get_character</code> <a href="https://github.com/brouberol/5esheets/blob/3b3bd1f99159f13e1b0e95b6ce3f825bc65a1e2d/dnd5esheets/api/character.py#L48-L63">handler</a>.</p>
<p><img alt="profile-cprofile" decoding="async" loading="lazy" src="https://user-images.githubusercontent.com/480131/258567029-c3fc4124-4822-49b2-8ce7-1cb79c501227.png"></p>
<p>Zooming in, we notice 2 things about the <code>get_character</code> span:</p>
<ul>
<li>its <code>ncalls</code> value is 9605, when we really called it 300 times</li>
<li>it is free-floating, completely unlinked from any other span</li>
</ul>
<p><img alt="get-character-span" decoding="async" loading="lazy" src="https://github.com/brouberol/5esheets/assets/480131/71ec8ae5-553b-44bc-9613-30b5da9a6240"></p>
<p>As an asynchronous function is "entered" and "exited" by the event loop at each <code>await</code> clause, every time the event-loop re-enters a function, <code>cProfile</code> will see this as an additional call, thus causing seemingly larger-than-normal <code>ncalls</code> numbers. Indeed, we <code>await</code> every-time we perform an SQL request, commit or refresh the SQLAlchemy session, or anything else inducing asynchronous I/O.
Secondly, the reason that the <code>get_character</code> span appears to be free-floating is because it is executed outside of the main thread, by the Python event-loop.</p>
<p>This means that our good old faithful <code>cProfile</code> might not cut it for this inherently asynchronous server, and we need a more modern profiler with builtin asynchronous support if we want to really make sense of where time is spent during a request.</p>
<h3 id="enter-pyinstrument">Enter <a href="https://pyinstrument.readthedocs.io/">pyinstrument</a>!</h3>
<p><code>pyinstrument</code> is a <em>statistical profiler</em>, contrary to <code>cProfile</code>, which is <em>deterministic</em>.</p>
<blockquote>
<p>Deterministic profiling is meant to reflect the fact that all function call, function return, and exception events are monitored, and precise timings are made for the intervals between these events (during which time the user’s code is executing). In contrast, statistical profiling [...] randomly samples the effective instruction pointer, and deduces where time is being spent. The latter technique traditionally involves less overhead (as the code does not need to be instrumented), but provides only relative indications of where time is being spent.</p>
<p><em><a href="https://docs.python.org/3/library/profile.html#what-is-deterministic-profiling">Source</a></em></p>
</blockquote>
<p>Second, it advertises native support for profiling asynchronous python code:</p>
<blockquote>
<p><code>pyinstrument</code> can profile async programs that use <code>async</code> and <code>await</code>. This async support works by tracking the context of execution, as provided by the built-in <a href="https://docs.python.org/3/library/contextvars.html"><code>contextvars</code></a> module.</p>
<p>When you start a <code>Profiler</code> with the <code>async_mode</code> enabled or strict (not disabled), that <code>Profiler</code> is attached to the current async context.</p>
<p>When profiling, <code>pyinstrument</code> keeps an eye on the context. When execution exits the context, it captures the await stack that caused the context to exit. Any time spent outside the context is attributed to the that halted execution of the await.</p>
<p><a href="https://pyinstrument.readthedocs.io/en/latest/how-it-works.html#async-profiling">Source</a></p>
</blockquote>
<p>This should allow us to get a sensible picture of where time is spent during the lifespan of a FastAPI request, while also skipping the spans that are too fast to be profiled.</p>
<h3 id="integrating-pyinstrument-with-fastapi">Integrating pyinstrument with FastAPI</h3>
<p>We rely on the <code>FastAPI.middleware</code> decorator to register a profiling middleware (only enabled if the <code>PROFILING_ENABLED</code> setting it set to <code>True</code>) in charge of profiling a request if the <code>profile=true</code> query argument is passed by the client.</p>
<p>By default, this middleware will generate a JSON report compatible with <a href="https://speedscope.app">Speedscope</a>, an online interactive flamegraph visualizer. However, if the <code>profile_format=html</code> query argument is passed, then a simple HTML report will be dumped to disk instead.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">fastapi</span> <span class="kn">import</span> <span class="n">Request</span>
<span class="kn">from</span> <span class="nn">pyinstrument</span> <span class="kn">import</span> <span class="n">Profiler</span>
<span class="kn">from</span> <span class="nn">pyinstrument.renderers.html</span> <span class="kn">import</span> <span class="n">HTMLRenderer</span>
<span class="kn">from</span> <span class="nn">pyinstrument.renderers.speedscope</span> <span class="kn">import</span> <span class="n">SpeedscopeRenderer</span>
<span class="k">def</span> <span class="nf">register_middlewares</span><span class="p">(</span><span class="n">app</span><span class="p">:</span> <span class="n">FastAPI</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">if</span> <span class="n">app</span><span class="o">.</span><span class="n">settings</span><span class="o">.</span><span class="n">PROFILING_ENABLED</span> <span class="ow">is</span> <span class="kc">True</span><span class="p">:</span>
<span class="nd">@app</span><span class="o">.</span><span class="n">middleware</span><span class="p">(</span><span class="s2">"http"</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">profile_request</span><span class="p">(</span><span class="n">request</span><span class="p">:</span> <span class="n">Request</span><span class="p">,</span> <span class="n">call_next</span><span class="p">:</span> <span class="n">Callable</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Profile the current request</span>
<span class="sd"> Taken from https://pyinstrument.readthedocs.io/en/latest/guide.html#profile-a-web-request-in-fastapi</span>
<span class="sd"> with small improvements.</span>
<span class="sd"> """</span>
<span class="c1"># we map a profile type to a file extension, as well as a pyinstrument profile renderer</span>
<span class="n">profile_type_to_ext</span> <span class="o">=</span> <span class="p">{</span><span class="s2">"html"</span><span class="p">:</span> <span class="s2">"html"</span><span class="p">,</span> <span class="s2">"speedscope"</span><span class="p">:</span> <span class="s2">"speedscope.json"</span><span class="p">}</span>
<span class="n">profile_type_to_renderer</span> <span class="o">=</span> <span class="p">{</span>
<span class="s2">"html"</span><span class="p">:</span> <span class="n">HTMLRenderer</span><span class="p">,</span>
<span class="s2">"speedscope"</span><span class="p">:</span> <span class="n">SpeedscopeRenderer</span><span class="p">,</span>
<span class="p">}</span>
<span class="c1"># if the `profile=true` HTTP query argument is passed, we profile the request</span>
<span class="k">if</span> <span class="n">request</span><span class="o">.</span><span class="n">query_params</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">"profile"</span><span class="p">,</span> <span class="kc">False</span><span class="p">):</span>
<span class="c1"># The default profile format is speedscope</span>
<span class="n">profile_type</span> <span class="o">=</span> <span class="n">request</span><span class="o">.</span><span class="n">query_params</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s2">"profile_format"</span><span class="p">,</span> <span class="s2">"speedscope"</span><span class="p">)</span>
<span class="c1"># we profile the request along with all additional middlewares, by interrupting</span>
<span class="c1"># the program every 1ms1 and records the entire stack at that point</span>
<span class="k">with</span> <span class="n">Profiler</span><span class="p">(</span><span class="n">interval</span><span class="o">=</span><span class="mf">0.001</span><span class="p">,</span> <span class="n">async_mode</span><span class="o">=</span><span class="s2">"enabled"</span><span class="p">)</span> <span class="k">as</span> <span class="n">profiler</span><span class="p">:</span>
<span class="n">response</span> <span class="o">=</span> <span class="k">await</span> <span class="n">call_next</span><span class="p">(</span><span class="n">request</span><span class="p">)</span>
<span class="c1"># we dump the profiling into a file</span>
<span class="n">extension</span> <span class="o">=</span> <span class="n">profile_type_to_ext</span><span class="p">[</span><span class="n">profile_type</span><span class="p">]</span>
<span class="n">renderer</span> <span class="o">=</span> <span class="n">profile_type_to_renderer</span><span class="p">[</span><span class="n">profile_type</span><span class="p">]()</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="sa">f</span><span class="s2">"profile.</span><span class="si">{</span><span class="n">extension</span><span class="si">}</span><span class="s2">"</span><span class="p">,</span> <span class="s2">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">out</span><span class="p">:</span>
<span class="n">out</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">profiler</span><span class="o">.</span><span class="n">output</span><span class="p">(</span><span class="n">renderer</span><span class="o">=</span><span class="n">renderer</span><span class="p">))</span>
<span class="k">return</span> <span class="n">response</span>
<span class="c1"># Proceed without profiling</span>
<span class="k">return</span> <span class="k">await</span> <span class="n">call_next</span><span class="p">(</span><span class="n">request</span><span class="p">)</span>
</code></pre></div>
<div class="Note">
<p>You can browse the project <a href="https://github.com/brouberol/5esheets/blob/main/dnd5esheets/middlewares.py">code</a> to see how the middleware is wired into the application itself</p>
</div>
<h3 id="lets-see-the-results">Let's see the results</h3>
<p><strong>HTML profile</strong>
<img alt="" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/fastapi-async-profiling/html-pyinstrument.webp"></p>
<p><strong>Speedscope profile</strong>
<img alt="" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/fastapi-async-profiling/speedscope.webp"></p>
<p>We see pretty clearly the different SQL requests being performed (the <code>execute</code> spans), the different <code>await</code> clauses in the code causing the event loop to pause the execution, and that most of the request time is spent in SQL requests.</p>
<p>Finally, using this setup, I was able to <a href="https://github.com/brouberol/5esheets/pull/180">observe the effects</a> of replacing the <code>json</code> stdlib library by <a href="https://github.com/ijl/orjson"><code>orjson</code></a> when deserializing JSON data from database, and speed up this endpoint by a couple of percent very easily.</p>
<h3 id="sources">Sources</h3>
<ul>
<li><a href="https://pyinstrument.readthedocs.io/en/latest/how-it-works.html">https://pyinstrument.readthedocs.io/en/latest/how-it-works.html</a></li>
<li><a href="https://www.roguelynn.com/words/asyncio-profiling/">https://www.roguelynn.com/words/asyncio-profiling</a></li>
</ul>Preventing a pull request from being merged until it's safe2023-07-25T00:00:00+02:002023-07-25T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2023-07-25:/preventing-a-pull-request-from-being-merged-until-its-safe<p>Sometimes, a pull request is ready to go, but shouldn't be merged before some other changes are merged first. While the patch is valid on its own, it might depend on other changes, and could even break the application if merged <em>before</em> the other. I'll demonstrate a simple technique relying on Github Actions and pull request labels to block a pull request from being merged, until deemed safe to do so.</p><p>Sometimes, a pull request is ready to go, but shouldn't be merged before some other changes are merged first. While the patch is valid on its own, it might depend on other changes, and could even break the application if merged <em>before</em> the other.</p>
<p>I'll demonstrate a simple technique relying on Github Actions and pull request labels to fully block a pull request from being merged until deemed safe (at least without some admin privileges on the repository).</p>
<p>First, we introduce a Github Actions <a href="https://github.com/brouberol/5esheets/blob/main/.github/workflows/fail-if-do-not-merge-label.yml">workflow</a> executed when a pull request is opened, labeled or unlabeled. This workflow will fail if labeled with <code>do not merge</code>.</p>
<div class="highlight"><pre><span></span><code><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Check do not merge</span>
<span class="nt">on</span><span class="p">:</span>
<span class="w"> </span><span class="c1"># Check label at every push in a feature branch</span>
<span class="w"> </span><span class="nt">push</span><span class="p">:</span>
<span class="w"> </span><span class="nt">branches-ignore</span><span class="p">:</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">main</span>
<span class="w"> </span><span class="c1"># Check label during the lifetime of a pull request</span>
<span class="w"> </span><span class="nt">pull_request</span><span class="p">:</span>
<span class="w"> </span><span class="nt">types</span><span class="p">:</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">opened</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">labeled</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">unlabeled</span>
<span class="nt">jobs</span><span class="p">:</span>
<span class="w"> </span><span class="nt">fail-for-do-not-merge</span><span class="p">:</span>
<span class="w"> </span><span class="nt">if</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">contains(github.event.pull_request.labels.*.name, 'do not merge')</span>
<span class="w"> </span><span class="nt">runs-on</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">ubuntu-latest</span>
<span class="w"> </span><span class="nt">steps</span><span class="p">:</span>
<span class="w"> </span><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Fail if PR is labeled with do not merge</span>
<span class="w"> </span><span class="nt">run</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">|</span>
<span class="w"> </span><span class="no">echo "This PR can't be merged, due to the 'do not merge' label."</span>
<span class="w"> </span><span class="no">exit 1</span>
</code></pre></div>
<p>We then define a branch protection rule for our <code>main</code> branch, by going to the repository <code>Settings</code>, then <code>Branches</code>. We add a new rule if none exist, tick <code>Require status checks to pass before merging</code>, and add the <code>fail-for-do-not-merge</code> to the list of required checks.</p>
<p>Finally, apply the <code>do not merge</code> label to your pull request.</p>
<div class="row">
<div class="column" style="flex: 70%">
<img src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/do-not-merge/required-checks.webp
">
</div>
<div class="column" style="flex: 30%">
<img src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/do-not-merge/labels.webp
">
</div>
</div>
<p>At that point, the <code>fail-for-do-not-merge</code> check will run and fail, preventing the PR to be merged.</p>
<p><img alt="" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/do-not-merge/merge-blocked.webp"></p>
<p>When the pull request is finally safe to merge, simply remove the <code>do not merge</code> tag, and the checks will automagically pass, thus allowing you to merge.</p>
<p><img alt="" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/do-not-merge/passing-checks.webp"></p>Generating pretty maps ready to be gift-wrapped2023-05-06T00:00:00+02:002023-05-06T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2023-05-06:/generating-pretty-maps-ready-to-be-gift-wrapped<p><img title="Lyon, France" alt="Lyon, France" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/lyonfrance-3000-A3-square-default.jpg" />I have been toying with the idea of generating visually pleasing maps centered on a given address, to have them printed and framed. The way I see it, it would make an original and personalised gift for the person living there. So when Marcelo de Oliveira Rosa Prates' <a href="https://github.com/marceloprates/prettymaps"><code>prettymaps</code></a> blew up on Reddit, I decided to try it.</p><p><img title="Lyon, France" alt="Lyon, France" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/lyonfrance-3000-A3-square-default.jpg" /></p>
<p>I have been toying with the idea of generating visually pleasing maps centered on a given address, to have them printed and framed. The way I see it, it would make an original and personalised gift for the person living there. So when Marcelo de Oliveira Rosa Prates' <a href="https://github.com/marceloprates/prettymaps"><code>prettymaps</code></a> blew up on Reddit, I decided to try it.</p>
<p>The library was great and the visuals looked incredible, yet, I felt it was lacking a couple of features if I were to print the maps.</p>
<ul>
<li>a CLI to make it easy to generate maps on the fly</li>
<li>easily changing the color scheme of buildings (and allowing black and white)</li>
<li>enabling the generation of rectangular maps, on top of circle and square</li>
<li>changing the output format of the figure to make it fit into a standard page (A3, A4, etc)</li>
<li>ensuring a 300dpi output</li>
<li>set the CLI command used to generate the map as the map title, for autodocumentation purposes</li>
</ul>
<p>My good friend <a href="https://etnbrd.com/">Etienne</a> solved the <a href="https://github.com/marceloprates/prettymaps/pull/105">rectangular map generation</a> in a <em>beautifully</em> laid out PR, that has been sadly sitting there for a while without attention. It <em>seems</em> that the repository owner got issues with NFT con "artists", and pretty much abandonned the project, which hasn't seen activity for the last 5 months.</p>
<p>Seeing this, I decided to <a href="https://github.com/brouberol/prettymaps">fork the project</a>, and work on the remaining ideas.</p>
<p>Here are a couple of examples of maps that I've generated and printed for people in my entourage.</p>
<div class="Note">
<p>The command used to generate each map is displayed as the map title</p>
</div>
<p><img
src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/248ruedespyrénées_75020_paris_france-2000-A3-square-RdPu-50.webp"
srcset="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/248ruedespyrénées_75020_paris_france-2000-A3-square-RdPu-30.webp 1448w, https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/248ruedespyrénées_75020_paris_france-2000-A3-square-RdPu-50.webp 2480w"
sizes="(max-width: 1500px) 1448w, 2480w"
alt="Paris 20e, France"
title="Paris 20e, France"
/>
<img
src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/40all.jeanjaurès31000toulouse_france-2000-A3-circle-default-50.webp"
srcset="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/40all.jeanjaurès31000toulouse_france-2000-A3-circle-default-30.webp 1488w, https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/40all.jeanjaurès31000toulouse_france-2000-A3-circle-default-50.webp 2480w"
sizes="(max-width: 1500px) 1488w, 2480w"
alt="Toulouse, France"
title="Toulouse, France"
/>
<img
src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/2impassedel'ancienneposte71100chalonsursaône_france-2000-A3-square-viridis_r-50.webp"
srcset="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/2impassedel'ancienneposte71100chalonsursaône_france-2000-A3-square-viridis_r-30.webp 1488w, https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/2impassedel'ancienneposte71100chalonsursaône_france-2000-A3-square-viridis_r-50.webp 2480w"
sizes="(max-width: 1500px) 1488w, 2480w"
alt="Chalon-sur-Saône, France"
title="Chalon-sur-Saône, France"
/>
<img
src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/19ruegustavebalny_60320_béthisy-saint-martin-3000-A3-square-Oranges-50.webp"
srcset="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/19ruegustavebalny_60320_béthisy-saint-martin-3000-A3-square-Oranges-30.webp 1488w, https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/prettymaps/19ruegustavebalny_60320_béthisy-saint-martin-3000-A3-square-Oranges-50.webp 2480w"
sizes="(max-width: 1500px) 1488w, 2480w"
alt="Béthisy Saint-Martin, France"
title="Béthisy Saint-Martin, France"
/></p>
<p>The color schemes are only applied to buildings, and are automatically generated from <a href="https://matplotlib.org/stable/gallery/color/colormap_reference.html"><code>matplotlib</code> colormaps</a>. This was an quick and easy to generate themes "for free". I also added a couple of Scottish tartan inspired themes, that I used to print a map as a wedding gift for a lovely franco-scottish couple.</p>
<p>My local printer bills me about 1.5€ for each print, which makes for an original and yet remarkably cheap gift. I recommend a thick and matte paper, without any texture, as it might collide with the map dotted background.</p>
<p>If you'd like to give it a try, feel free to have a look at the <a href="https://github.com/brouberol/prettymaps">repository</a>!</p>Monitoring my solar panel power production2023-05-03T00:00:00+02:002023-05-03T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2023-05-03:/monitoring-my-solar-panel-power-production<p>I have recently acquired two solar panels from <a href="https://sunology.eu/products/sunology-play-kit-solaire">Sunology</a> advertising a cumulated instantaneous production of up to 810W. The panels come with a smart plug emitting the data to <a href="https://iot.tuya.com/">Tuya</a>, in order to retain and graph historical data. However, the only available granuarity for that data is <em>daily</em> kWh production. In order to optimize the orientation and placement of the panels, as well as measure the production efficiency (power produced / 810 * 100), I needed a much finer granularity than that. I decided to query the data myself and send it to Datadog.</p><p>I have recently acquired two solar panels from <a href="https://sunology.eu/products/sunology-play-kit-solaire">Sunology</a> advertising a cumulated instantaneous production of up to 810W. The panels come with a smart plug emitting the data to <a href="https://iot.tuya.com/">Tuya</a>, in order to retain and graph historical data. However, the only available granuarity for that data is <em>daily</em> kWh production. In order to optimize the orientation and placement of the panels, as well as measure the production efficiency (power produced / 810 * 100), I needed a much finer granularity than that. I decided to query the data myself and send it to Datadog.</p>
<p><img alt="information flow from plug to Datadog" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/solar-panel/schema.webp"></p>
<p>The first thing I needed to do was to find a working client that would be able to talk to the plug. It seems that <a href="https://github.com/jasonacox/tinytuya"><code>tinytuya</code></a> would do the job. However, it didn't seem like I could simply fetch the data from the plug locally. Instead, I first needed to create a Tuya account, a cloud project, and add the plug to the project devices to get both an API key as well as a key for the plug. That proved out to be quite tedious, as the Tuya IoT interface is very confusing and slow, but I managed thanks to these <a href="https://www.home-assistant.io/integrations/tuya/">Home-Assistant instructions</a>.</p>
<p><img alt="" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/solar-panel/tuya-project.webp"></p>
<hr>
<p><img alt="" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/solar-panel/tuya-device.webp"></p>
<p>With that data now available, I was then able to setup the <code>tinytuya</code> client on a Raspberry Pi with network access to the plug IP.</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>tinytuya<span class="w"> </span>wizard
TinyTuya<span class="w"> </span>Setup<span class="w"> </span>Wizard<span class="w"> </span><span class="o">[</span><span class="m">1</span>.12.4<span class="o">]</span>
<span class="w"> </span>Enter<span class="w"> </span>API<span class="w"> </span>Key<span class="w"> </span>from<span class="w"> </span>tuya.com:<span class="w"> </span><span class="o">[</span>REDACTED<span class="o">]</span>
<span class="w"> </span>Enter<span class="w"> </span>API<span class="w"> </span>Secret<span class="w"> </span>from<span class="w"> </span>tuya.com:<span class="w"> </span><span class="o">[</span>REDACTED<span class="o">]</span>
<span class="w"> </span>Enter<span class="w"> </span>any<span class="w"> </span>Device<span class="w"> </span>ID<span class="w"> </span>currently<span class="w"> </span>registered<span class="w"> </span><span class="k">in</span><span class="w"> </span>Tuya<span class="w"> </span>App<span class="w"> </span><span class="o">(</span>used<span class="w"> </span>to<span class="w"> </span>pull<span class="w"> </span>full<span class="w"> </span>list<span class="o">)</span><span class="w"> </span>or<span class="w"> </span><span class="s1">'scan'</span><span class="w"> </span>to<span class="w"> </span>scan<span class="w"> </span><span class="k">for</span><span class="w"> </span>one:<span class="w"> </span><span class="o">[</span>REDACTED<span class="o">]</span>
<span class="w"> </span>Enter<span class="w"> </span>Your<span class="w"> </span>Region<span class="w"> </span><span class="o">(</span>Options:<span class="w"> </span>cn,<span class="w"> </span>us,<span class="w"> </span>us-e,<span class="w"> </span>eu,<span class="w"> </span>eu-w,<span class="w"> </span>or<span class="w"> </span><span class="k">in</span><span class="o">)</span>:<span class="w"> </span>eu
>><span class="w"> </span>Configuration<span class="w"> </span>Data<span class="w"> </span>Saved<span class="w"> </span>to<span class="w"> </span>tinytuya.json
>><span class="w"> </span>Device<span class="w"> </span>Listing
>><span class="w"> </span>Saving<span class="w"> </span>list<span class="w"> </span>to<span class="w"> </span>devices.json
<span class="w"> </span><span class="m">1</span><span class="w"> </span>registered<span class="w"> </span>devices<span class="w"> </span>saved
>><span class="w"> </span>Saving<span class="w"> </span>raw<span class="w"> </span>TuyaPlatform<span class="w"> </span>response<span class="w"> </span>to<span class="w"> </span>tuya-raw.json
Poll<span class="w"> </span><span class="nb">local</span><span class="w"> </span>devices?<span class="w"> </span><span class="o">(</span>Y/n<span class="o">)</span>:<span class="w"> </span>y
Scanning<span class="w"> </span><span class="nb">local</span><span class="w"> </span>network<span class="w"> </span><span class="k">for</span><span class="w"> </span>Tuya<span class="w"> </span>devices...
<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="nb">local</span><span class="w"> </span>devices<span class="w"> </span>discovered
Polling<span class="w"> </span><span class="nb">local</span><span class="w"> </span>devices...
<span class="w"> </span><span class="o">[</span>Sunology<span class="w"> </span><span class="o">]</span><span class="w"> </span><span class="m">192</span>.168.5.171<span class="w"> </span>-<span class="w"> </span><span class="o">[</span>On<span class="o">]</span><span class="w"> </span>-<span class="w"> </span>DPS:<span class="w"> </span><span class="o">{</span><span class="s1">'1'</span>:<span class="w"> </span>True,<span class="w"> </span><span class="s1">'9'</span>:<span class="w"> </span><span class="m">0</span>,<span class="w"> </span><span class="s1">'17'</span>:<span class="w"> </span><span class="m">109</span>,<span class="w"> </span><span class="s1">'18'</span>:<span class="w"> </span><span class="m">2704</span>,<span class="w"> </span><span class="s1">'19'</span>:<span class="w"> </span><span class="m">6491</span>,<span class="w"> </span><span class="s1">'20'</span>:<span class="w"> </span><span class="m">2379</span>,<span class="w"> </span><span class="s1">'21'</span>:<span class="w"> </span><span class="m">1</span>,<span class="w"> </span><span class="s1">'22'</span>:<span class="w"> </span><span class="m">529</span>,<span class="w"> </span><span class="s1">'23'</span>:<span class="w"> </span><span class="m">26153</span>,<span class="w"> </span><span class="s1">'24'</span>:<span class="w"> </span><span class="m">13705</span>,<span class="w"> </span><span class="s1">'25'</span>:<span class="w"> </span><span class="m">3040</span>,<span class="w"> </span><span class="s1">'26'</span>:<span class="w"> </span><span class="m">0</span><span class="o">}</span>
>><span class="w"> </span>Saving<span class="w"> </span>device<span class="w"> </span>snapshot<span class="w"> </span>data<span class="w"> </span>to<span class="w"> </span>snapshot.json
>><span class="w"> </span>Saving<span class="w"> </span>IP<span class="w"> </span>addresses<span class="w"> </span>to<span class="w"> </span>devices.json
<span class="w"> </span><span class="m">1</span><span class="w"> </span>device<span class="w"> </span>IP<span class="w"> </span>addresses<span class="w"> </span>found
Done.
</code></pre></div>
<p>At that point, the <code>tinytuya</code> wizard script had scanned the networks the Pi was connected to, found the plug, and was able to connect to it via the provided device key.</p>
<p>I then created a dedicated APP/API keypair on Datadog, and scheduled this python script to run every minute via cron.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># Run every minute via this crontab</span>
<span class="c1"># * * * * * cd /home/br/tuya && /home/br/tuya/.env/bin/python exporter.py</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="kn">import</span> <span class="nn">datadog</span>
<span class="kn">import</span> <span class="nn">tinytuya</span>
<span class="n">datadog</span><span class="o">.</span><span class="n">initialize</span><span class="p">(</span>
<span class="n">api_key</span><span class="o">=</span><span class="s2">"[REDACTED]"</span><span class="p">,</span>
<span class="n">app_key</span><span class="o">=</span><span class="s2">"[REDACTED]"</span><span class="p">,</span>
<span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">"devices.json"</span><span class="p">)</span> <span class="k">as</span> <span class="n">device_file</span><span class="p">:</span>
<span class="n">device_data</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">device_file</span><span class="p">)</span>
<span class="n">plug</span> <span class="o">=</span> <span class="n">tinytuya</span><span class="o">.</span><span class="n">OutletDevice</span><span class="p">(</span>
<span class="n">dev_id</span><span class="o">=</span><span class="n">device_data</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s2">"id"</span><span class="p">],</span>
<span class="n">address</span><span class="o">=</span><span class="n">device_data</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s2">"ip"</span><span class="p">],</span>
<span class="n">local_key</span><span class="o">=</span><span class="n">device_data</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="s2">"key"</span><span class="p">],</span>
<span class="n">version</span><span class="o">=</span><span class="mf">3.3</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">plug_status</span> <span class="o">=</span> <span class="n">plug</span><span class="o">.</span><span class="n">updatedps</span><span class="p">()</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">plug_status</span><span class="p">[</span><span class="s2">"dps"</span><span class="p">]</span>
<span class="n">now</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span>
<span class="n">metrics</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">if</span> <span class="s2">"18"</span> <span class="ow">in</span> <span class="n">data</span><span class="p">:</span>
<span class="n">current</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s2">"18"</span><span class="p">]</span> <span class="c1"># mA</span>
<span class="n">metrics</span><span class="o">.</span><span class="n">append</span><span class="p">(</span>
<span class="p">{</span>
<span class="s2">"metric"</span><span class="p">:</span> <span class="s2">"solarpanel.current"</span><span class="p">,</span>
<span class="s2">"type"</span><span class="p">:</span> <span class="s2">"gauge"</span><span class="p">,</span>
<span class="s2">"points"</span><span class="p">:</span> <span class="p">[(</span><span class="n">now</span><span class="p">,</span> <span class="n">current</span><span class="p">)],</span>
<span class="s2">"tags"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"location:terrasse_1"</span><span class="p">],</span>
<span class="p">}</span>
<span class="p">)</span>
<span class="k">if</span> <span class="s2">"19"</span> <span class="ow">in</span> <span class="n">data</span><span class="p">:</span>
<span class="n">power</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s2">"19"</span><span class="p">]</span> <span class="o">/</span> <span class="mf">10.0</span> <span class="c1"># W</span>
<span class="n">metrics</span><span class="o">.</span><span class="n">append</span><span class="p">(</span>
<span class="p">{</span>
<span class="s2">"metric"</span><span class="p">:</span> <span class="s2">"solarpanel.power"</span><span class="p">,</span>
<span class="s2">"type"</span><span class="p">:</span> <span class="s2">"gauge"</span><span class="p">,</span>
<span class="s2">"points"</span><span class="p">:</span> <span class="p">[(</span><span class="n">now</span><span class="p">,</span> <span class="n">power</span><span class="p">)],</span>
<span class="s2">"tags"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"location:terrasse_1"</span><span class="p">],</span>
<span class="p">}</span>
<span class="p">)</span>
<span class="k">if</span> <span class="s2">"20"</span> <span class="ow">in</span> <span class="n">data</span><span class="p">:</span>
<span class="n">voltage</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s2">"20"</span><span class="p">]</span> <span class="o">/</span> <span class="mf">10.0</span> <span class="c1"># V</span>
<span class="n">metrics</span><span class="o">.</span><span class="n">append</span><span class="p">(</span>
<span class="p">{</span>
<span class="s2">"metric"</span><span class="p">:</span> <span class="s2">"solarpanel.voltage"</span><span class="p">,</span>
<span class="s2">"type"</span><span class="p">:</span> <span class="s2">"gauge"</span><span class="p">,</span>
<span class="s2">"points"</span><span class="p">:</span> <span class="p">[(</span><span class="n">now</span><span class="p">,</span> <span class="n">voltage</span><span class="p">)],</span>
<span class="s2">"tags"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"location:terrasse_1"</span><span class="p">],</span>
<span class="p">}</span>
<span class="p">)</span>
<span class="n">datadog</span><span class="o">.</span><span class="n">api</span><span class="o">.</span><span class="n">Metric</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="n">metrics</span><span class="o">=</span><span class="n">metrics</span><span class="p">)</span>
</code></pre></div>
<p>At that point, the measured current, voltage and power was sent out to Datadog every minute, and I was then able to create the following <a href="https://p.datadoghq.com/sb/bc352bb82-f277a5982d97a0a007ab56fbc05e0ee8">dashboard</a>:</p>
<p><a href="https://p.datadoghq.com/sb/bc352bb82-f277a5982d97a0a007ab56fbc05e0ee8"><img alt="Dashboard detailing electricity production over time" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/solar-panel/dd-dash.webp"></a></p>
<div class="Note">
<p>This dashboard makes it seem like the panel can only hit 75% efficiency, when I have seen them hit 95-99%. This is due to the Datadog point interpolation happening on large time windows. When we focus on a smaller window, we can see these high (albeit brief) peaks. <img alt="" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/solar-panel/dd-dash-2.webp"></p>
</div>
<p>With that granularity, I realized that the panels only started to really kick in after midday, and that I should probably move them to a spot with more exposure if I wanted to produce more than 4kWh a day (measured on a hot and sunny day without any clouds). That day, I only hit 85% efficiency though, even though I had hit 99% at some point during the previous weeks. That makes me wonder if I need to wash the panel.</p>
<p><strong>Edit</strong>: it rained that very night and I did hit 95% efficiency the next day.</p>Speeding up a 21h job to 8 minutes: a story of SQLAlchemy optimization2023-01-08T00:00:00+01:002023-01-08T00:00:00+01:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2023-01-08:/speeding-up-a-21h-job-to-8-minutes-a-story-of-sqlalchemy-optimization<p>In this article published on the <a href="https://medium.com/alan/blog-post-optimizing-our-longest-nightly-job-a-story-of-sessions-complexity-and-toilets-750ef4dfaa51">Alan tech blog</a>, we explain how my team has reduced the runtime of our longest nightly job from 21h to about 8 minutes, by using simple profiling and SQLAlchemy optimizations.</p><p>I have recently published an article on the <a href="https://medium.com/alan/blog-post-optimizing-our-longest-nightly-job-a-story-of-sessions-complexity-and-toilets-750ef4dfaa51">Alan tech blog</a> walking the reader through how we have reduced the runtime of our longest nightly job from 21 hours to about 8 minutes, by using simple profiling and SQLAlchemy optimizations.</p>
<p>Enjoy the reading!</p>Measuring the coverage of a rust program in Github Actions2022-04-26T00:00:00+02:002022-04-26T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2022-04-26:/measuring-the-coverage-of-a-rust-program-in-github-actions<p>In this article, I will go through how I set up code coverage measurement for <code>bo</code>, my text editor written in Rust, and publicly hosted the coverage report on S3.</p><p>After having faced a couple of of regressions in <a href="https://github.com/brouberol/bo"><code>bo</code></a> (my personal text editor <a href="/metaprocrastinating-on-writing-a-book-by-writing-a-text-editor">written in Rust</a>) in the past couple of days, I have tried to increase the number of unit tests related to the codebase sections handling navigation. I already had some unit tests, but I needed to know what lines of code were <em>not</em> tested, to know what area of the codebase I needed to focus on.</p>
<p>To do this, I used Mozilla's excellent <a href="https://github.com/mozilla/grcov"><code>grcov</code></a> project. I followed their instructions and ran the following commands locally, in my work directory.</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span><span class="nb">export</span><span class="w"> </span><span class="nv">RUSTFLAGS</span><span class="o">=</span><span class="s2">"-Cinstrument-coverage"</span>
<span class="gp">$ </span>cargo<span class="w"> </span>build
<span class="gp">$ </span><span class="nb">export</span><span class="w"> </span><span class="nv">LLVM_PROFILE_FILE</span><span class="o">=</span><span class="s2">"bo-%p-%m.profraw"</span>
<span class="gp">$ </span>cargo<span class="w"> </span><span class="nb">test</span>
<span class="gp">$ </span>grcov<span class="w"> </span>.<span class="w"> </span>-s<span class="w"> </span>.<span class="w"> </span>--binary-path<span class="w"> </span>./target/debug/<span class="w"> </span>-t<span class="w"> </span>html<span class="w"> </span>--branch<span class="w"> </span>--ignore-not-existing<span class="w"> </span>-o<span class="w"> </span>./target/debug/coverage/
<span class="gp">$ </span>open<span class="w"> </span>./target/debug/coverage/index.html
</code></pre></div>
<p>This way, I got a beautiful HTML report in which I could see my code coverage, either global, file by file,</p>
<p><img alt="Coverage report" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/rust-coverage/cov.webp"></p>
<p>or line by line.</p>
<p><img alt="Coverage report" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/rust-coverage/cov2.webp"></p>
<p><code>grcov</code> even generates nice SVG badges displaying the coverage score, that I could display on the project homepage!</p>
<p>What I ultimately wanted though, was to have every commit touching my <code>main</code> branch to trigger a new coverage generation report, that I could host somewhere public and read at leisure when I needed to.</p>
<p>To do so, I set-up a publicly accessible s3 bucket, configured to host a static website, which turns out to be remarkably easy to do <a href="https://github.com/brouberol/infrastructure/commit/75192443319f36cfbdfbcee0086322c958e3cc82#diff-abe63f10056054dcb55782e4be3ccb2ec28b47e6192b3ee1b45e46ff1884738aR62-R74">in terraform</a>:</p>
<div class="highlight"><pre><span></span><code><span class="kr">resource</span><span class="w"> </span><span class="nc">"aws_s3_bucket"</span><span class="w"> </span><span class="nv">"github-brouberol-coverage"</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="na">bucket</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"my-bucket-name"</span>
<span class="w"> </span><span class="na">provider</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nv">aws.euwest</span>
<span class="w"> </span><span class="na">acl</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"public-read"</span>
<span class="w"> </span><span class="na">force_destroy</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="no">false</span>
<span class="w"> </span><span class="nb">versioning</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="na">enabled</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="no">false</span>
<span class="w"> </span><span class="na">mfa_delete</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="no">false</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="nb">website</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="na">index_document</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"index.html"</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</code></pre></div>
<div class="Note">
<p>There are other ways to host the HTML files than S3 (such as <a href="https://pages.github.com/">Github Pages</a>), and you do <em>not</em> have you terraform to do it, but I so happen to have a <a href="https://github.com/brouberol/infrastructure/tree/master/terraform">terraform codebase</a> for my personal infrastructure, which made it a no-brainer. If you decide do host the files another way, feel free to jump <a href="#github-secrets">ahead</a>.</p>
</div>
<p>I then created an AWS user, associated with an AWS access_key/secret_key pair and the following IAM policy, granting that user read/write permissions on that S3 bucket, and nothing else.</p>
<div class="highlight"><pre><span></span><code><span class="p">{</span>
<span class="w"> </span><span class="nt">"Version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2012-10-17"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"Statement"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
<span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="nt">"Sid"</span><span class="p">:</span><span class="w"> </span><span class="s2">"VisualEditor0"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"Effect"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Allow"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"Action"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
<span class="w"> </span><span class="s2">"s3:PutObject"</span><span class="p">,</span>
<span class="w"> </span><span class="s2">"s3:GetObjectAcl"</span><span class="p">,</span>
<span class="w"> </span><span class="s2">"s3:GetObject"</span><span class="p">,</span>
<span class="w"> </span><span class="s2">"s3:ListBucket"</span><span class="p">,</span>
<span class="w"> </span><span class="s2">"s3:DeleteObject"</span><span class="p">,</span>
<span class="w"> </span><span class="s2">"s3:PutObjectAcl"</span>
<span class="w"> </span><span class="p">],</span>
<span class="w"> </span><span class="nt">"Resource"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
<span class="w"> </span><span class="s2">"arn:aws:s3:::<my-bucket-name>"</span><span class="p">,</span>
<span class="w"> </span><span class="s2">"arn:aws:s3:::<my-bucket-name>/*"</span>
<span class="w"> </span><span class="p">]</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div>
<div id="github-secrets"></div>
<p>I then had to store the bucket name, keypair and AWS region name as encrypted secrets in the <code>bo</code> <a href="https://github.com/brouberol/bo">repository</a>, by going to <code>Settings > Secrets > Actions > New repository secret</code>.</p>
<p><img alt="Secrets" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/rust-coverage/secrets.webp"></p>
<p>Once that was all set up, the project CI (Github Actions) needed to perform the <a href="https://github.com/brouberol/bo/blob/main/.github/workflows/tests.yml#L28-L77">following actions</a>:</p>
<ul>
<li>Checking out the project and setting up a nightly rust toolchain</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">actions/checkout@v2</span>
<span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Setup toolchain</span>
<span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">actions-rs/toolchain@v1</span>
<span class="w"> </span><span class="nt">with</span><span class="p">:</span>
<span class="w"> </span><span class="nt">toolchain</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">nightly</span>
<span class="w"> </span><span class="nt">override</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">true</span>
<span class="w"> </span><span class="nt">profile</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">minimal</span>
</code></pre></div>
<ul>
<li>running the unit tests with profiling and coverage collection enabled</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Run tests</span>
<span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">actions-rs/cargo@v1</span>
<span class="w"> </span><span class="nt">with</span><span class="p">:</span>
<span class="w"> </span><span class="nt">command</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">test</span>
<span class="w"> </span><span class="nt">args</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--all-features --no-fail-fast</span><span class="w"> </span><span class="c1"># Customize args for your own needs</span>
<span class="w"> </span><span class="nt">env</span><span class="p">:</span>
<span class="w"> </span><span class="nt">CARGO_INCREMENTAL</span><span class="p">:</span><span class="w"> </span><span class="s">'0'</span>
<span class="w"> </span><span class="nt">RUSTFLAGS</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">|</span>
<span class="w"> </span><span class="no">-Zprofile -Ccodegen-units=1 -Cinline-threshold=0 -Clink-dead-code</span>
<span class="w"> </span><span class="no">-Coverflow-checks=off -Cpanic=abort -Zpanic_abort_tests -Cinstrument-coverage</span>
<span class="w"> </span><span class="nt">RUSTDOCFLAGS</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">|</span>
<span class="w"> </span><span class="no">-Zprofile -Ccodegen-units=1 -Cinline-threshold=0 -Clink-dead-code</span>
<span class="w"> </span><span class="no">-Coverflow-checks=off -Cpanic=abort -Zpanic_abort_tests -Cinstrument-coverage'</span>
</code></pre></div>
<ul>
<li>generating the coverage report using <code>grcov</code>, using the <a href="https://github.com/actions-rs/grcov/"><code>actions-rs/grcov</code></a> action.</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Gather coverage data</span>
<span class="w"> </span><span class="nt">id</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">coverage</span>
<span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">actions-rs/grcov@v0.1</span>
</code></pre></div>
<ul>
<li>measuring the total coverage score, and report it in a check, if the job is associated to a pull request</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">Report coverage in PR status for the current commit</span>
<span class="w"> </span><span class="nt">if</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">github.ref_name != 'main'</span>
<span class="w"> </span><span class="nt">run</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">|</span>
<span class="w"> </span><span class="no">set -eu</span>
<span class="w"> </span><span class="no">total=$(cat ${COV_REPORT_DIR}/badges/flat.svg | egrep '<title>coverage: ' | cut -d: -f 2 | cut -d% -f 1 | sed 's/ //g')</span>
<span class="w"> </span><span class="no">curl -s "https://brouberol:${GITHUB_TOKEN}@api.github.com/repos/brouberol/bo/statuses/${COMMIT_SHA}" -d "{\"state\": \"success\",\"target_url\": \"https://github.com/brouberol/bo/pull/${PULL_NUMBER}/checks?check_run_id=${RUN_ID}\",\"description\": \"${total}%\",\"context\": \"Measured coverage\"}"</span>
<span class="w"> </span><span class="nt">env</span><span class="p">:</span>
<span class="w"> </span><span class="nt">GITHUB_TOKEN</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ secrets.GITHUB_TOKEN }}</span>
<span class="w"> </span><span class="nt">COMMIT_SHA</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ github.event.pull_request.head.sha }}</span>
<span class="w"> </span><span class="nt">RUN_ID</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ github.run_id }}</span>
<span class="w"> </span><span class="nt">PULL_NUMBER</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ github.event.pull_request.number }}</span>
<span class="w"> </span><span class="nt">COV_REPORT_DIR</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ steps.coverage.outputs.report }}</span>
</code></pre></div>
<p><img alt="Secrets" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/rust-coverage/cov3.webp"></p>
<ul>
<li>uploading the whole HTML coverage report to S3, using the <a href="https://github.com/jakejarvis/s3-sync-action">jakejarvis/s3-sync-action</a> action. We only do this for commits belonging the <code>main</code> branch (<em>i.e.</em> direct pushes or after a pull request was merged).</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="p p-Indicator">-</span><span class="w"> </span><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="s">"Upload</span><span class="nv"> </span><span class="s">the</span><span class="nv"> </span><span class="s">HTML</span><span class="nv"> </span><span class="s">coverage</span><span class="nv"> </span><span class="s">report</span><span class="nv"> </span><span class="s">to</span><span class="nv"> </span><span class="s">S3"</span>
<span class="w"> </span><span class="nt">if</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">github.ref_name == 'main'</span>
<span class="w"> </span><span class="nt">uses</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">jakejarvis/s3-sync-action@master</span>
<span class="w"> </span><span class="nt">with</span><span class="p">:</span>
<span class="w"> </span><span class="nt">args</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">--acl public-read --follow-symlinks --delete</span>
<span class="w"> </span><span class="nt">env</span><span class="p">:</span>
<span class="w"> </span><span class="nt">AWS_S3_BUCKET</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ secrets.AWS_BUCKET }}</span>
<span class="w"> </span><span class="nt">AWS_ACCESS_KEY_ID</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ secrets.AWS_ACCESS_KEY_ID }}</span>
<span class="w"> </span><span class="nt">AWS_SECRET_ACCESS_KEY</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ secrets.AWS_SECRET_ACCESS_KEY }}</span>
<span class="w"> </span><span class="nt">AWS_REGION</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ secrets.AWS_REGION }}</span>
<span class="w"> </span><span class="nt">SOURCE_DIR</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">${{ steps.coverage.outputs.report }}</span>
<span class="w"> </span><span class="nt">DEST_DIR</span><span class="p">:</span><span class="w"> </span><span class="s">'bo'</span>
</code></pre></div>
<p>With all of that set up, the coverage report is now <a href="http://github-brouberol-coverage.s3-website.eu-west-3.amazonaws.com/bo/main">publicly available</a>, refreshed every time a new commit hits <code>main</code>, and I even get a coverage shield for free! <img alt="coverage shield" decoding="async" loading="lazy" src="https://github-brouberol-coverage.s3.eu-west-3.amazonaws.com/bo/main/badges/flat.svg"></p>Tools I'm thankful for2022-02-22T00:00:00+01:002022-02-22T00:00:00+01:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2022-02-22:/tools-im-thankful-for<p>Software engineers sometimes have a reputation for being overly critical when it comes to tools and programming languages. The web is full of rants, heated debates and articles about what technology is "better" and which is "crap". It was thus refreshing to read an post titled <a href="https://www.jowanza.com/blog/2022/2/21/software-im-thankful-for"><em>Software I'm thankful for</em></a>, that shone a light on some pieces of software in a positive light. In honor of this article, I've decided to go through the same exercise.</p><p>Software engineers sometimes have a reputation of being overly critical when it comes to tools and programming languages. The web is full of rants, heated debates and articles about what technology is "better" and which is "crap". It was thus refreshing to read an post titled <a href="https://www.jowanza.com/blog/2022/2/21/software-im-thankful-for"><em>Software I'm thankful for</em></a>, that shone a light on some pieces of software in a positive light. In honor of this article, I've decided to go through the same exercise.</p>
<h2 id="python">Python</h2>
<p><a href="https://python.org">Python</a> was my gateway to becoming a software engineer. It was the first programming language I <em>loved</em>, and I still do to this day.
I wrote Python code professionally for a an AI startup, an e-ticketing startup, the Scottish government, a global hosting provider, a huge observability SaaS. I've written large Python webapps and quick Python scripts. I've written large asynchronous task workflows processing payments, trained machine learning models, written self-documented REST APIs, found my house listing by scraping the web, <a href="/river-monitoring-with-datadog">monitor the level of the river close by</a>, all of that in Python.</p>
<p>I also write Python code to maintain my own <a href="https://github.com/brouberol/infrastructure">infrastructure</a>, that I deploy via <a href="https://docs.ansible.com/">ansible</a>, itself written in Python. This blog is generated via <a href="https://pelican.readthedocs.org">Pelican</a>, which is written in Python. I've started to play with a Raspberry Pi Pico, that I program in ... <a href="http://docs.circuitpython.org/en/latest/README.html">CircuitPython</a>. It's ubiquitous, and I've heard it be called "The second best tool for every job", meaning that it probably won't be the most performant tool for what you're working on, but you'll make progress really fast.</p>
<p>Learning and programming Python has taught me many programming concepts, such as object-oriented programming, functional programming, unit testing, dataclasses, metaprogramming, REST APIs, HTTP, JSON, etc.</p>
<p>I however now realize that it also allowed me to get introduced to <em>lower-level</em> concepts, such as ioctl, sockets, system calls, file descriptors, etc, through the reassuring lens of the Python standard library, instead of having to interact with these concepts in C, which was much more intimidating (and still is today).</p>
<h2 id="docker">Docker</h2>
<p>The first time I was introduced to <a href="https://docs.docker.com/">Docker</a> was at a Python meetup in Lyon, circa 2013. After the 30 minute long presentation, I still had no clue as to what any of it meant and why I'd ever need it and pretty much shrugged it off. As the Docker ecosystem flourished and the dust settled, I started to understand the appeal.</p>
<p>Do you need to run redis to prototype against? Just run <code>docker run redis</code> and <em>voila</em>. Do you want to run <code>calibre-web</code> on your local VPS without having to install its dependencies in your system libraries? <a href="https://github.com/brouberol/infrastructure/blob/0e2ece50b45bc998cfc09dff1dc002c96f91cdee/playbooks/roles/gallifrey/calibre/tasks/main.yml#L10-L26">Sure</a>.</p>
<p>Docker allowed me to self-host a collection of tools that I use every day, package and run applications in extremely large production environments, spin up development environments without having to pollute my system libraries. It boosted my productivity and became part of my day-to-day workflow. None of these are the <em>real</em> reason why I'm thankful for Docker.</p>
<p>I've seen many companies break down their monolith into dockerized microservices. The commonly invoked reasons are allowing teams to chose their own language for each project, and helping the horizontal scaling of some load-critical apps. As useful Docker was to start a single container, it didn't solve the issue of starting several containers that could communicate with each other on a single host. Enter <a href="https://docs.docker.com/compose/">docker-compose</a>, which in turn didn't solve the issue of orchestrating containers on a fleet of nodes. Enter <a href="https://mesosphere.github.io/marathon/">Mesos/Marathon</a>, <a href="https://docs.docker.com/engine/swarm/">Docker Swarm</a>, <a href="https://kubernetes.io">Kubernetes</a>, <a href="https://aws.amazon.com/fr/ecs/">Amazon ECS</a>, etc.</p>
<p>The beef I have with Docker is that the hype around its <em>ecosystem</em> caused small companies to onboard immense amount of complexity from the absolute get-go, to help with recruiting. Because engineers want to build experience with Kubernetes, these companies find themselves dividing their attention between grappling with its inherent complexity, distributed tracing, image recycling policies, RBAC, etc, and building their actual core value. </p>
<p>This is why I'm thankful for Docker and its ecosystem. I believe I've seen situations in which it truly was critically useful, and I'll now be able to differentiate between situations in which we need it, and situations in which we only wished we did. </p>
<h2 id="raspberry-pi">Raspberry Pi</h2>
<p>Before I joined OVH, the <em>only</em> sysadmin experience I had was tinkering with my <a href="https://www.raspberrypi.com/products/raspberry-pi-4-model-b/">Raspberry Pi</a>. Thanks to that 35$ matchbox-sized computer, I got to learn iptables and systemd, port forwarding, ssh hardening, file system checks and repairs. But really, the crucial point is that I was able to learn all that by making mistakes. I'd rather learn about why you need to be careful with <code>iptables -j DROP</code> in the comfort of my own home than in a production, high pressure, environment. I can't stress the impact that learning without the fear of public failure had on me. </p>
<p>I'm now getting into electronics through the <a href="https://www.raspberrypi.com/products/raspberry-pi-pico/">Raspberry Pi Pico</a>, which opens a whole new exploration and tinkering domain for me!</p>
<h2 id="the-terminal">The terminal</h2>
<p>The terminal is a truly important part of my day as a software engineer. It's really what allows me to feel in control. Like Python, it became a familiar tool in which I could discover entirely new domains, interact with new systems and concepts. I learned so much from it that I decided to help people out <a href="/category/essential-tools-and-practices-for-the-aspiring-software-developer">getting familiarized with the terminal and the shell</a>. </p>Sending a webhook from Synology DSM to Discord2022-01-17T00:00:00+01:002022-01-17T00:00:00+01:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2022-01-17:/sending-a-webhook-from-synology-dsm-to-discord<p>Given the fact that running a Datadog agent on a Synology Play NAS is not obvious, I wanted to enable Discord webhooks push notifications (as this is where my Datadog alerts are already being sent). This way, I'd get plenty of alerts "for free" without having to configure new Datadog …</p><p>Given the fact that running a Datadog agent on a Synology Play NAS is not obvious, I wanted to enable Discord webhooks push notifications (as this is where my Datadog alerts are already being sent). This way, I'd get plenty of alerts "for free" without having to configure new Datadog monitors.</p>
<p>While sending webhooks notifications from a Synology NAS to Discord is technically possible, the DSM UI somehow seems to prevent us from doing so, as documented in this <a href="https://www.synoforum.com/threads/webhooks-to-post-alerts-messages-on-to-discord.6725/#post-32618">forum thread</a>. Somehow, we <em>have</em> to include a <code>hello world</code> message in the notification, as part of the message content, without which, the UI won't allow us to save the webhook configuration.</p>
<p>You can however circumvent the issue by <code>ssh</code>-ing into the NAS and edit the <code>/usr/syno/etc/synowebhook.conf</code> into this:</p>
<div class="highlight"><pre><span></span><code><span class="p">{</span>
<span class="w"> </span><span class="nt">"Discord"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="nt">"needssl"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"port"</span><span class="p">:</span><span class="w"> </span><span class="mi">8090</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"prefix"</span><span class="p">:</span><span class="w"> </span><span class="s2">"A new system event occurred on your %HOSTNAME%"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"req_header"</span><span class="p">:</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"req_method"</span><span class="p">:</span><span class="w"> </span><span class="s2">"post"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"req_param"</span><span class="p">:</span><span class="w"> </span><span class="s2">"{\"username\":\"Synology\", \"avatar_url\": \"https://play-lh.googleusercontent.com/HjbYWbXJ-6e6Cia-mBbHDSdontW1yE6MHMaXqlHW80CQegDOEPQ1HGACxvEpnqCUHgo\", \"embeds\": [{\"description\": \"@@TEXT@@\", \"title\": \"@@PREFIX@@\"}]}"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"sepchar"</span><span class="p">:</span><span class="w"> </span><span class="s2">" "</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"template"</span><span class="p">:</span><span class="w"> </span><span class="s2">"$webhook_url"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"custom"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"url"</span><span class="p">:</span><span class="w"> </span><span class="s2">"$webhook_url"</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</code></pre></div>
<div class="Note">
<p><b>Note</b>: replace <code>$webhook_url</code> by your Discord webhook URL.</p>
</div>
<p>When this is done, you should see a <code>Discord</code> webhook in your Webhook Push Services, and you should now be able to send a test message to Discord!</p>
<p><picture>
<source srcset="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/syno-discord/dark/discord-notif.webp" media="(prefers-color-scheme: dark)">
<img alt="" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/syno-discord/light/discord-notif.webp">
</picture></p>
<p>Now, any warning or alert generated from DSM will automatically be sent to Discord as well!</p>Metaprocrastinating on writing a book by writing a text editor2021-09-04T00:00:00+02:002021-09-04T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2021-09-04:/metaprocrastinating-on-writing-a-book-by-writing-a-text-editor<p>If you have been following my <a href="https://blog.balthazar-rouberol.com/category/essential-tools-and-practices-for-the-aspiring-software-developer">Essential Tools and Practices for the Aspiring Software Developer</a> posts and were anxious to read more, you might have noticed that they stopped coming after a while. I have a draft for the last chapter, and I regularly think about getting back to it, at least to get some closure. Alas, procrastination being what it is, I never did. My procrastination level became really interesting when I convinced myself that one of the reasons that I didn't want to write that final chapter was that my text editor was standing in the way. I was either using a full-fledged code editor (Sublime Text/VSCode) riddled with complex features I didn't need (autocompletion, linting, etc) or getting lost in configuring <code>vim</code> into the perfect markdown editor. Either way, these were the wrong tools for the job, and my only way to get back to writing was to.. write my own?</p><p>If you have been following my <a href="https://blog.balthazar-rouberol.com/category/essential-tools-and-practices-for-the-aspiring-software-developer">Essential Tools and Practices for the Aspiring Software Developer</a> posts and were anxious to read more, you might have noticed that they stopped coming after a while. I have a draft for the last chapter, and I regularly think about getting back to it, at least to get some closure. Alas, procrastination being what it is, I never did.</p>
<p>My procrastination level became really interesting when I convinced myself that one of the reasons that I didn't want to write that final chapter was that my text editor was standing in the way. I was either using a full-fledged code editor (Sublime Text/VSCode) riddled with complex features I didn't need (autocompletion, linting, etc) or getting lost in configuring <code>vim</code> into the perfect markdown editor. Either way, these were the wrong tools for the job, and my only way to get back to writing was to.. write my own?</p>
<p>And thus, <a href="https://github.com/brouberol/bo"><code>bo</code></a> was born.</p>
<video controls>
<source src="https://user-images.githubusercontent.com/480131/131999617-61acc5a2-4055-4cd1-9da1-134ee9e075b4.mp4" type="video/mp4">
</video>
<p>The idea was to create a simple text editor, with powerful <code>vim</code>-like navigation. It should allow me to write in a very simple interface,
while being able to navigate through the text in a couple of keystrokes, leveraging the muscle memory I built over the years using <code>vim</code> (or the vim mode in various editors). </p>
<p>I wanted it to be written in Rust, as it would be a good opportunity for me to write non-trivial code in a safe language, and also because, well, it just sounded fun.</p>
<p>I've been working on it on and off in the last month, and I've implemented enough features so that it's starting to feel comfortable.</p>
<p><img alt="bo-help" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/metaprocrastination-bo/bo-help.webp"></p>
<p>There's still <a href="https://github.com/brouberol/bo/issues">a lot to do</a>! I'd be delighted if you wanted to test it and give it a go!</p>
<p><em>Written with <code>bo</code>.</em></p>How to setup a personal wireguard VPN2019-12-11T00:00:00+01:002019-12-11T00:00:00+01:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2019-12-11:/how-to-setup-a-personal-wireguard-vpn<p>This article will provide guidance about how to setup a Wireguard VPN between a server and your phone, allowing you to avoid being snooped on while you travel.</p><p>My work takes me to the United-States multiple times a year, and I've never been comfortable using the hotel Wi-Fi, or even my company VPN for that matter, when I'm there. I want to be assured that what I do online is my business and my business alone.</p>
<p>I had heard about <a href="https://wireguard.com">Wireguard</a> multiple times, how performant and simple it was compared to OpenVPN (I'd like to have a talk with whomever came up with the OpenVPN config file...). I decided to jump in and give it a try. The idea was to setup a VPN access point on my VPS, hosted in Paris, to which I could connect when I travel.</p>
<hr>
<h2 id="installing-wireguard">Installing wireguard</h2>
<p>I followed Wireguard's <a href="https://www.wireguard.com/install">official install instructions</a>. However, I also needed to install the headers files for the kernel I was running so that <code>dkms</code> could compile the <code>wiregard</code> kernel module.</p>
<div class="highlight"><pre><span></span><code><span class="gp">% </span>apt-get<span class="w"> </span>install<span class="w"> </span>linux-headers-<span class="k">$(</span>uname<span class="w"> </span>-r<span class="k">)</span>
<span class="gp">% </span>add-apt-repository<span class="w"> </span>ppa:wireguard/wireguard
<span class="gp">% </span>apt-get<span class="w"> </span>update
<span class="gp">% </span>apt-get<span class="w"> </span>install<span class="w"> </span>wireguard
</code></pre></div>
<p>If everything is going according to plan, you should see the <code>wireguard</code> kernel module being compiled by <code>dkms</code> at install time:</p>
<div class="highlight"><pre><span></span><code><span class="go">...</span>
<span class="go">DKMS: build completed.wireguard.ko:</span>
<span class="go">Running module version sanity check.</span>
<span class="go"> - Original module</span>
<span class="go"> - No original module exists within this kernel</span>
<span class="go"> - Installation</span>
<span class="go"> - Installing to /lib/modules/X.Y.Z-ABC-generic/updates/dkms/</span>
<span class="go">...</span>
</code></pre></div>
<p>At that point, you should be able to see the module in the <code>lsmod</code> output and load it.</p>
<div class="highlight"><pre><span></span><code><span class="gp">% </span>lsmod<span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>wireguard
<span class="go">wireguard 204800 0</span>
<span class="go">ip6_udp_tunnel 16384 1 wireguard</span>
<span class="go">udp_tunnel 16384 1 wireguard</span>
<span class="gp">% </span>modprobe<span class="w"> </span>wireguard
</code></pre></div>
<h2 id="configuring-the-server-peer">Configuring the server peer</h2>
<p>First off, we create the server <code>wireguard</code> peer's public and private keys.</p>
<div class="highlight"><pre><span></span><code><span class="gp">% </span><span class="nb">cd</span><span class="w"> </span>/etc/wireguard
<span class="gp">% </span><span class="nb">umask</span><span class="w"> </span><span class="m">077</span><span class="w"> </span><span class="c1"># disable public access</span>
<span class="gp">% </span>wg<span class="w"> </span>genkey<span class="w"> </span><span class="p">|</span><span class="w"> </span>tee<span class="w"> </span>privatekey<span class="w"> </span><span class="p">|</span><span class="w"> </span>wg<span class="w"> </span>pubkey<span class="w"> </span>><span class="w"> </span>publickey
</code></pre></div>
<p>We now configure the server peer, assuming that the VPS public network interface is <code>ens2</code>. We'll use the <code>192.168.2.0/24</code> subnet for all <code>wireguard</code>-related addresses, and assign <code>192.168.2.1</code> IP to the server peer.</p>
<div class="highlight"><pre><span></span><code><span class="gp">% </span>cat<span class="w"> </span><<EOF<span class="w"> </span>><span class="w"> </span>/etc/wireguard/wg0.conf
<span class="go">[Interface]</span>
<span class="gp"># </span>The<span class="w"> </span>IP<span class="w"> </span>assigned<span class="w"> </span>to<span class="w"> </span>the<span class="w"> </span>wg0<span class="w"> </span>interface
<span class="go">Address = 192.168.2.1/24</span>
<span class="gp"># </span>The<span class="w"> </span>port<span class="w"> </span>wireguard<span class="w"> </span>will<span class="w"> </span>listen<span class="w"> </span>on
<span class="go">ListenPort = <public port></span>
<span class="gp"># </span>The<span class="w"> </span>private<span class="w"> </span>key<span class="w"> </span>used<span class="w"> </span>by<span class="w"> </span>the<span class="w"> </span><span class="nb">local</span><span class="w"> </span>peer
<span class="go">PrivateKey = $(cat /etc/wireguard/privatekey)</span>
<span class="gp"># </span>Accept<span class="w"> </span>traffic<span class="w"> </span>to<span class="w"> </span>the<span class="w"> </span>wg0<span class="w"> </span>interface<span class="w"> </span>and<span class="w"> </span>allow<span class="w"> </span>NATing<span class="w"> </span>traffic<span class="w"> </span>from<span class="w"> </span>ens2<span class="w"> </span>to<span class="w"> </span>wg0
<span class="go">PostUp = iptables -A FORWARD -i %i -j ACCEPT; iptables -A FORWARD -o %i -j ACCEPT; iptables -t nat -A POSTROUTING -o ens2 -j MASQUERADE</span>
<span class="go">PostDown = iptables -D FORWARD -i %i -j ACCEPT; iptables -D FORWARD -o %i -j ACCEPT; iptables -t nat -D POSTROUTING -o ens2 -j MASQUERADE</span>
<span class="go">EOF</span>
<span class="gp">% </span>rm<span class="w"> </span>/etc/wireguard/privatekey
</code></pre></div>
<p>We also need to authorize UDP traffic on the <code><public port></code> port.</p>
<div class="highlight"><pre><span></span><code><span class="gp">% </span>iptables<span class="w"> </span>-i<span class="w"> </span>ens2<span class="w"> </span>-p<span class="w"> </span>udp<span class="w"> </span>--dport<span class="w"> </span><public<span class="w"> </span>port><span class="w"> </span>-j<span class="w"> </span>ACCEPT
</code></pre></div>
<p>Once that's done, we're now able to use <code>wg-quick</code> to setup the <code>wg0</code> network interface, as well as the <code>MASQUERADE</code> iptables rules that will NAT the traffic between the public <code>ens2</code> interface to <code>wg0</code>. We can actually use systemd for that, as we're assured that the <code>wg0</code> interface is re-created in case of a reboot.</p>
<div class="highlight"><pre><span></span><code><span class="gp">% </span>systemctl<span class="w"> </span>start<span class="w"> </span>wg-quick@wg0
<span class="go">[#] ip link add wg0 type wireguard</span>
<span class="go">[#] wg setconf wg0 /dev/fd/63</span>
<span class="go">[#] ip -4 address add 192.168.2.1/24 dev wg0</span>
<span class="go">[#] ip link set mtu 1420 up dev wg0</span>
<span class="go">[#] iptables -A FORWARD -i wg0 -j ACCEPT; iptables -A FORWARD -o wg0 -j ACCEPT; iptables -t nat -A POSTROUTING -o ens2 -j MASQUERADE</span>
<span class="gp">% </span>systemctl<span class="w"> </span><span class="nb">enable</span><span class="w"> </span>wg-quick@wg0
<span class="go">Created symlink from /etc/systemd/system/multi-user.target.wants/wg-quick@wg0.service to /lib/systemd/system/wg-quick@.service.</span>
</code></pre></div>
<h2 id="configuring-the-phone-peer">Configuring the phone peer</h2>
<p>I use the Wireguard <a href="https://play.google.com/store/apps/details?id=com.wireguard.android">Android app</a>, and assign the <code>192.168.2.2/32</code> address to my phone, as well as add the server peer details (as Wireguard is a point-to-point VPN without a client/server architecture).</p>
<p>The server peer public key is set to the content of the remote <code>/etc/wireguard/publickey</code> file, on my VPS. As I want to route all my phone traffic through <code>wireguard</code>, I set the <code>Allowed IPs</code> field to <code>0.0.0.0/0</code>, and the peer endpoint to <code><server public ens2 IP>:<public port></code>.</p>
<p><img alt="screenshot" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/wireguard/android-wg.jpg"></p>
<h2 id="authorizing-the-phone-peer">Authorizing the phone peer</h2>
<p>After having generated a public key for the phone peer, we also need to authorize it on the server peer and restart <code>wireguard</code>.</p>
<div class="highlight"><pre><span></span><code><span class="gp">% </span>cat<span class="w"> </span><<EOF<span class="w"> </span>>><span class="w"> </span>/etc/wireguard/wg0.conf
<span class="go">[Peer]</span>
<span class="gp"># </span>Phone<span class="w"> </span>peer
<span class="go">PublicKey = <phone peer public key generated in app></span>
<span class="go">AllowedIPs = 192.168.2.2/32</span>
<span class="go">EOF</span>
<span class="gp">% </span>systemctl<span class="w"> </span>restart<span class="w"> </span>wg-quick@wg0
</code></pre></div>
<h2 id="testing-the-whole-thing">Testing the whole thing</h2>
<p>My phone disconnected from the server <code>wireguard</code> peer, I'm now able to inspect the state of the <code>wg0</code> server network interface:</p>
<div class="highlight"><pre><span></span><code><span class="gp">% </span>ifconfig<span class="w"> </span>wg0
<span class="go">wg0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00</span>
<span class="go"> inet addr:192.168.2.1 P-t-P:192.168.2.1 Mask:255.255.255.0</span>
<span class="go"> UP POINTOPOINT RUNNING NOARP MTU:1420 Metric:1</span>
<span class="go"> RX packets:0 errors:0 dropped:0 overruns:0 frame:0</span>
<span class="go"> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0</span>
<span class="go"> collisions:0 txqueuelen:1</span>
<span class="go"> RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)</span>
</code></pre></div>
<p>I then connect my phone to the server peer, open a random webpage, and <em>voila</em>, we can see traffic going through the server <code>wg0</code> interface.</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>ifconfig<span class="w"> </span>wg0
<span class="go">wg0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00</span>
<span class="go"> inet addr:192.168.2.1 P-t-P:192.168.2.1 Mask:255.255.255.0</span>
<span class="go"> UP POINTOPOINT RUNNING NOARP MTU:1420 Metric:1</span>
<span class="go"> RX packets:4084 errors:0 dropped:132 overruns:0 frame:0</span>
<span class="go"> TX packets:4895 errors:0 dropped:0 overruns:0 carrier:0</span>
<span class="go"> collisions:0 txqueuelen:1</span>
<span class="go"> RX bytes:452436 (452.4 KB) TX bytes:2954188 (2.9 MB)</span>
</code></pre></div>
<p>A quick <code>tcpdump</code> shows that the data flowing to <code>wg0</code> is indeed encrypted.</p>
<div class="highlight"><pre><span></span><code><span class="gp">$ </span>tcpdump<span class="w"> </span>-i<span class="w"> </span>wg0<span class="w"> </span>-vv<span class="w"> </span>-c<span class="w"> </span><span class="m">100</span><span class="w"> </span>-X
<span class="go">tcpdump: listening on wg0, link-type RAW (Raw IP), capture size 262144 bytes</span>
<span class="go">15:14:05.096356 IP (tos 0x0, ttl 105, id 47301, offset 0, flags [none], proto TCP (6), length 332)</span>
<span class="go"> wq-in-f188.1e100.net.5228 > 192.168.2.2.46641: Flags [P.], cksum 0x8308 (correct), seq 1867855144:1867855424, ack 229885280, win 253, options [nop,nop,TS val 426814177 ecr 2017832], length 280</span>
<span class="go"> 0x0000: 4500 014c b8c5 0000 6906 fe02 4a7d 8cbc E..L....i...J}..</span>
<span class="go"> 0x0010: c0a8 0202 146c b631 6f55 3528 0db3 c560 .....l.1oU5(...`</span>
<span class="go"> 0x0020: 8018 00fd 8308 0000 0101 080a 1970 aae1 .............p..</span>
<span class="go"> 0x0030: 001e ca28 1703 0301 13e7 c1f4 5089 ed04 ...(........P...</span>
<span class="go"> 0x0040: aba6 ef67 2cbe a7b3 f0cc 02d0 caaa d675 ...g,..........u</span>
<span class="go">...</span>
</code></pre></div>
<p>I now have have a personal VPN I can use whenever I travel abroad.</p>
<hr>
<p>Thanks to Thomas for being patient with me while answering networking questions at 11pm, and for proof-reading this article. Any remaining mistake is my own.</p>Managing my infra like it's 20192019-07-22T00:00:00+02:002019-07-22T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2019-07-22:/managing-my-infra-like-its-2019<p>I recently realized that I was routinely managing thousands of servers and petabytes of data in my daily job, but was still managing my own personal infrastructure like I was living in 1999.</p>
<p><img alt="my-infra" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/managing-infra/infra.png"></p>
<hr>
<p>With the advent of configuration management tools such as <a href="https://docs.ansible.com/">Ansible</a>, <a href="https://www.chef.io/">Chef</a>, and the like, it became easier …</p><p>I recently realized that I was routinely managing thousands of servers and petabytes of data in my daily job, but was still managing my own personal infrastructure like I was living in 1999.</p>
<p><img alt="my-infra" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/managing-infra/infra.png"></p>
<hr>
<p>With the advent of configuration management tools such as <a href="https://docs.ansible.com/">Ansible</a>, <a href="https://www.chef.io/">Chef</a>, and the like, it became easier to configure instances in a reproducible manner by defining said configuration as code. <a href="http://terraform.io/">Terraform</a> made it easier to codify and provision cloud resources: instances, but also security groups, permissions, storage, load balancers, etc.</p>
<p>It's easy to simply think of a cloud infrastructure as a pool of compute resource. It is however often so much more than that. When executed right, The Cloud is a set of meshed services, interacting and communicating with each other (possibly with compute resources sitting in the middle). That applies for vast and complex infrastructures such as the one I work on at <a href="https://datadoghq.com">Datadog</a>, but it also applies to my ridiculously tiny personal one. Realizing this got me thinking. Why wasn't I using the same tools and techniques to manage my small infrastructure than the ones I'm using daily?</p>
<h2 id="my-infrastructure">My infrastructure</h2>
<p>My personal infrastructure consists of (drumrolls...) 3 servers:</p>
<ul>
<li>a VPS running in Scaleway, hosting my personal services (personal website, blog, git repositories, <a href="https://radicale.org/documentation/">CalDAV server</a>, <a href="https://usefathom.com/">traffic analytics</a>, <a href="https://thelounge.chat/">IRC client</a>, <a href="https://www.wallabag.org/en">Read-it-later service</a>, etc)</li>
<li>a VPS running in OVH, hosting my mother's website</li>
<li>a Raspberry Pi, running in my living room, hosting private services (<a href="https://kresus.org/en/index.html">Kresus</a>)</li>
</ul>
<p>Until now, each of these servers were managed in an <em>ad-hoc</em> fashion, sometimes with scripts, sometimes without. All the cloud resources on which my services (S3 buckets, DNS zones, etc) were managed manually, using the cloud provider web console.</p>
<p>I manage my DNS zones with OVH, I use the AWS S3 bucket free tier for the blog images, and Datadog for monitoring.</p>
<p><img alt="ssl-expiry-monitoring" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/managing-infra/datadog-monitors.png"></p>
<h2 id="improving-the-setup">Improving the setup</h2>
<p>I had several objectives in mind to improve the current setup:</p>
<ul>
<li>define all instances configuration and state in <a href="https://docs.ansible.com/ansible/latest/user_guide/playbooks.html">ansible playbooks</a></li>
<li>re-use and share instances configuration by leveraging <a href="https://docs.ansible.com/ansible/latest/user_guide/playbooks_reuse_roles.html">ansible roles</a></li>
<li>define and manage all cloud resources using <a href="https://terraform.io">terraform</a> to never have to log into a cloud web console again</li>
<li>secure all web-services with an automatically renewed SSL certificate provided by Let's Encrypt</li>
<li>run all services behind a reverse-proxy, using a docker container or a <a href="https://www.brendanlong.com/systemd-user-services-are-amazing.html">userland systemd service</a> with minimal permissions and privileges</li>
<li>monitor the hosts and services using <a href="https://datadoghq.com">Datadog</a> (free for 5 hosts or less) , with monitors define in terraform</li>
<li>secure the SSH connections of the internet-facing hosts via <a href="https://duo.com/">Duo</a> (free for 10 users or less)</li>
<li>be able to SSH into all hosts from my personal and work laptop, as well as from my <a href="https://play.google.com/store/apps/details?id=org.connectbot&hl=en_US">phone</a></li>
<li>monitor my daily backups</li>
</ul>
<h2 id="show-me-the-code">Show me the code</h2>
<p>You can have a look at the code <a href="https://github.com/brouberol/infrastructure">here</a>. I've purposefully omitted the <code>terraform/global_vars/main.tf</code> file, credentials are obviously encrypted, API keys are defined in my home directory, but everything else is readable openly. My hope is that that readers might either learn something or point out where I'm doing something silly or insecure.</p>
<h2 id="what-now">What now?</h2>
<p>I'm now confident that I can open some of these services to friends, if they want to. I measure and monitor my own SLIs, the expiry of the SSL certificates, and can intervene from anywhere if something breaks.</p>
<p><img alt="ssl-expiry-monitoring" decoding="async" loading="lazy" src="https://balthazar-rouberol-blog.s3.eu-west-3.amazonaws.com/managing-infra/ssl-expiry-monitoring.png"></p>
<p>My infrastructure is now more secure, and has been audited by fellow peers <sup id="fnref:review"><a class="footnote-ref" href="#fn:review">1</a></sup>. I'm now confident I can restore the services in the face of an instance loss (which is very important for my mother, as her website has a fair amount of traffic and brings her regular new customers).</p>
<p>I'm also dogfooding Datadog features, which got to me suggest a couple of improvements to the Datadog <a href="https://www.terraform.io/docs/providers/datadog/index.html">terraform provider</a> which will be worked on next quarter.</p>
<div class="footnote">
<hr>
<ol>
<li id="fn:review">
<p>Thanks to Mehdi and Thomas for the thorough playbook review. Any remaining mistake or silliness is my own. <a class="footnote-backref" href="#fnref:review" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>Allocating unbounded resources to a kubernetes pod2018-09-29T00:00:00+02:002018-09-29T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2018-09-29:/allocating-unbounded-resources-to-a-kubernetes-pod<p>Note: this article assumes that the reader is familiar with <a href="https://kubernetes.io">Kubernetes</a> and Linux <a href="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/ch01">cgroups</a>.</p>
<hr>
<p>When deploying a pod in a Kubernetes cluster, you normally have 2 choices when it comes to <a href="https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container">resources</a> allotment:</p>
<ul>
<li>defining CPU/memory resource requests and limits <a href="https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container">at the pod level</a></li>
<li>defining default CPU/memory requests and …</li></ul><p>Note: this article assumes that the reader is familiar with <a href="https://kubernetes.io">Kubernetes</a> and Linux <a href="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/ch01">cgroups</a>.</p>
<hr>
<p>When deploying a pod in a Kubernetes cluster, you normally have 2 choices when it comes to <a href="https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container">resources</a> allotment:</p>
<ul>
<li>defining CPU/memory resource requests and limits <a href="https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container">at the pod level</a></li>
<li>defining default CPU/memory requests and limits at the <a href="https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/memory-default-namespace/">namespace level</a> using a <code>LimitRange</code></li>
</ul>
<p>However, what if circumstances allowed you to allocate unbounded resources to your pod? While that would go against the idea of bin-packing pods by using resource bounded cgroups, it could still useful if you ran no other pods that the unbounded one on your node. In that case, wouldn't be interested in protecting your pod against any noisy neighbour, and you'd want it to be able to use all the available node resources.</p>
<p>This (while not strictly documented) can be accomplished by using the following resource limits and requests:</p>
<div class="highlight"><pre><span></span><code><span class="n">resources</span><span class="o">:</span>
<span class="w"> </span><span class="n">limits</span><span class="o">:</span>
<span class="w"> </span><span class="n">cpu</span><span class="o">:</span><span class="w"> </span><span class="mi">0</span>
<span class="w"> </span><span class="n">memory</span><span class="o">:</span><span class="w"> </span><span class="mi">0</span>
<span class="w"> </span><span class="n">requests</span><span class="o">:</span>
<span class="w"> </span><span class="n">cpu</span><span class="o">:</span><span class="w"> </span><span class="mi">0</span>
<span class="w"> </span><span class="n">memory</span><span class="o">:</span><span class="w"> </span><span class="mi">0</span>
</code></pre></div>
<p>In our case, we also have a defined <code>LimitRange</code> in our namespace, so we want to make sure that our request for unbounded resources does not get overridden by the default values.</p>
<div class="highlight"><pre><span></span><code><span class="err">$</span><span class="w"> </span><span class="n">kubectl</span><span class="w"> </span><span class="n">describe</span><span class="w"> </span><span class="n">limitrange</span><span class="w"> </span><span class="n">my</span><span class="o">-</span><span class="n">limit</span><span class="o">-</span><span class="n">range</span>
<span class="n">Name</span><span class="p">:</span><span class="w"> </span><span class="n">my</span><span class="o">-</span><span class="n">limit</span><span class="o">-</span><span class="n">range</span>
<span class="n">Namespace</span><span class="p">:</span><span class="w"> </span><span class="k">default</span>
<span class="n">Type</span><span class="w"> </span><span class="n">Resource</span><span class="w"> </span><span class="n">Min</span><span class="w"> </span><span class="n">Max</span><span class="w"> </span><span class="k">Default</span><span class="w"> </span><span class="n">Request</span><span class="w"> </span><span class="k">Default</span><span class="w"> </span><span class="n">Limit</span>
<span class="o">----</span><span class="w"> </span><span class="o">--------</span><span class="w"> </span><span class="o">---</span><span class="w"> </span><span class="o">---</span><span class="w"> </span><span class="o">---------------</span><span class="w"> </span><span class="o">-------------</span>
<span class="n">Container</span><span class="w"> </span><span class="n">memory</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">512</span><span class="n">Mi</span><span class="w"> </span><span class="mi">1</span><span class="n">Gi</span>
<span class="n">Container</span><span class="w"> </span><span class="n">cpu</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">500</span><span class="n">m</span><span class="w"> </span><span class="mi">1</span>
<span class="err">$</span><span class="w"> </span><span class="n">kubectl</span><span class="w"> </span><span class="k">get</span><span class="w"> </span><span class="n">pod</span><span class="w"> </span><span class="n">my</span><span class="o">-</span><span class="n">pod</span><span class="w"> </span><span class="o">-</span><span class="n">o</span><span class="w"> </span><span class="n">jsonpath</span><span class="o">=</span><span class="c">'{.spec.containers[0].resources}'</span>
<span class="n">map</span><span class="o">[</span><span class="n">limits</span><span class="p">:</span><span class="n">map</span><span class="o">[</span><span class="n">cpu</span><span class="p">:</span><span class="mi">0</span><span class="w"> </span><span class="n">memory</span><span class="p">:</span><span class="mi">0</span><span class="o">]</span><span class="w"> </span><span class="n">requests</span><span class="p">:</span><span class="n">map</span><span class="o">[</span><span class="n">cpu</span><span class="p">:</span><span class="mi">0</span><span class="w"> </span><span class="n">memory</span><span class="p">:</span><span class="mi">0</span><span class="o">]]</span>
</code></pre></div>
<p>It seems that the <code>LimitRange</code> has not overridden our request. However, we see a different picture when we inspect the node running our pod:</p>
<div class="highlight"><pre><span></span><code><span class="err">$</span><span class="w"> </span><span class="n">kubectl</span><span class="w"> </span><span class="k">get</span><span class="w"> </span><span class="n">pod</span><span class="w"> </span><span class="n">my</span><span class="o">-</span><span class="n">pod</span><span class="w"> </span><span class="o">-</span><span class="n">o</span><span class="w"> </span><span class="n">jsonpath</span><span class="o">=</span><span class="c">'{.spec.nodeName}'</span>
<span class="n">my</span><span class="o">-</span><span class="n">node</span>
<span class="err">$</span><span class="w"> </span><span class="n">kubectl</span><span class="w"> </span><span class="n">describe</span><span class="w"> </span><span class="n">node</span><span class="w"> </span><span class="n">my</span><span class="o">-</span><span class="n">node</span>
<span class="p">...</span>
<span class="n">Non</span><span class="o">-</span><span class="n">terminated</span><span class="w"> </span><span class="n">Pods</span><span class="p">:</span><span class="w"> </span><span class="p">(</span><span class="mi">6</span><span class="w"> </span><span class="ow">in</span><span class="w"> </span><span class="n">total</span><span class="p">)</span>
<span class="w"> </span><span class="k">Namespace</span><span class="w"> </span><span class="nn">Name</span><span class="w"> </span><span class="n">CPU</span><span class="w"> </span><span class="n">Requests</span><span class="w"> </span><span class="n">CPU</span><span class="w"> </span><span class="n">Limits</span><span class="w"> </span><span class="n">Memory</span><span class="w"> </span><span class="n">Requests</span><span class="w"> </span><span class="n">Memory</span><span class="w"> </span><span class="n">Limits</span>
<span class="w"> </span><span class="o">---------</span><span class="w"> </span><span class="o">----</span><span class="w"> </span><span class="o">------------</span><span class="w"> </span><span class="o">----------</span><span class="w"> </span><span class="o">---------------</span><span class="w"> </span><span class="o">-------------</span>
<span class="w"> </span><span class="n">datadog</span><span class="w"> </span><span class="n">my</span><span class="o">-</span><span class="n">pod</span><span class="w"> </span><span class="mi">500</span><span class="n">m</span><span class="w"> </span><span class="p">(</span><span class="mi">13</span><span class="err">%</span><span class="p">)</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">(</span><span class="mi">26</span><span class="err">%</span><span class="p">)</span><span class="w"> </span><span class="mi">512</span><span class="n">Mi</span><span class="w"> </span><span class="p">(</span><span class="mi">2</span><span class="err">%</span><span class="p">)</span><span class="w"> </span><span class="mi">1</span><span class="n">Gi</span><span class="w"> </span><span class="p">(</span><span class="mi">4</span><span class="err">%</span><span class="p">)</span>
<span class="p">...</span>
</code></pre></div>
<p>Who should we believe? When different parts of the control plane disagree on the resource allotment, there's really one place to get the truth from: the container cgroup itself.</p>
<p>To do so, we need to exec into the pod, and inspect the CPU quota and memory limit values.</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>kubectl<span class="w"> </span><span class="nb">exec</span><span class="w"> </span>-it<span class="w"> </span>my-pod
user@my-pod:/$<span class="w"> </span>cat<span class="w"> </span>/sys/fs/cgroup/cpu/cpu.cfs_quota_us
-1
</code></pre></div>
<p>As detailed on the <a href="https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt">Linux kernel documentation</a>, or the Red Hat <a href="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu">documentation portal</a></p>
<blockquote>
<p>A value of -1 for <code>cpu.cfs_quota_us</code> indicates that the group does not have any
bandwidth restriction in place, such a group is described as an unconstrained
bandwidth group. This represents the traditional work-conserving behavior for
CFS.</p>
</blockquote>
<p>Now, the memory.</p>
<div class="highlight"><pre><span></span><code><span class="k">user</span><span class="nv">@my</span><span class="o">-</span><span class="nl">pod</span><span class="p">:</span><span class="o">/</span><span class="err">$</span><span class="w"> </span><span class="n">cat</span><span class="w"> </span><span class="o">/</span><span class="n">sys</span><span class="o">/</span><span class="n">fs</span><span class="o">/</span><span class="n">cgroup</span><span class="o">/</span><span class="n">memory</span><span class="o">/</span><span class="n">memory</span><span class="p">.</span><span class="n">limit_in_bytes</span>
<span class="mi">9223372036854771712</span>
</code></pre></div>
<p>That looks odd. This would indicate that the process has a limit of ... 8191TB of memory!</p>
<p>Digging <a href="https://unix.stackexchange.com/questions/420906/what-is-the-value-for-the-cgroups-limit-in-bytes-if-the-memory-is-not-restricte">a bit further</a>, we learn that <code>9223372036854771712</code> is a kind of "magic" number in the memory management layer of the kernel, meaning that the process gets unbounded memory.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Looking at the cgroup itself showed that a value of <code>0</code> for cpu/memory requests/limits is not intercepted by the <code>LimitRange</code> in place, and is translated to an unbounded cgroup in the end. It also showed that the pod resource requests and limits reported at the node level are inaccurate.</p>On meritocracy, identity and context2018-09-21T00:00:00+02:002018-09-21T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2018-09-21:/on-meritocracy-identity-and-context<p><strong>Before reading</strong></p>
<p>This is a deeply personal article, that hasn't been easy to write, especially with all the tension currently occurring in the tech industry, around inclusiveness, gender, code of conducts, etc. I've done my best to explain my thoughts on the matter, while being as respectful as possible. If …</p><p><strong>Before reading</strong></p>
<p>This is a deeply personal article, that hasn't been easy to write, especially with all the tension currently occurring in the tech industry, around inclusiveness, gender, code of conducts, etc. I've done my best to explain my thoughts on the matter, while being as respectful as possible. If you feel that you disagree with me, I'd be happy to debate with you, as long as the discussion stays civil and respectful.</p>
<hr>
<h3 id="the-lay-of-the-land">The lay of the land</h3>
<p>I have been working as a professional engineer in the tech industry for the last 7 years or so. My first contact with the subject of under-representation of minorities in the industry came during EuroPython 2012, when a tasteless tweet was posted the very night after which <a href="http://www.roguelynn.com/">Lynn Root</a> talked about <a href="https://www.youtube.com/watch?v=l2PnVKQJg0I">"Increasing women engagement in the Python community"</a> (these events are best summarized by <a href="http://www.roguelynn.com/words/a-memorable-europython-for-the-better/">Lynn herself</a>). And these events kept <a href="https://en.wiktionary.org/wiki/Donglegate">happening</a>, and <a href="https://www.dailydot.com/debug/sexist-tech-conference-slide/">happening</a>, and <a href="https://en.wikipedia.org/wiki/Sexism_in_the_technology_industry#Incidents">happening</a>. My personal view has since then been that each of these highly publicized events were caused by an appalling lack of tact, thoughtfulness, empathy and respect, and that conference attendees should make sure to behave appropriately or should suffer the consequences.</p>
<p><a href="http://confcodeofconduct.com/">Code of Conducts</a> started to be <a href="https://ep2018.europython.eu/en/coc/">defined</a> for <a href="https://www.dotconferences.com/codeofconduct">conferences</a> and <a href="https://www.djangoproject.com/conduct/">online projects</a>, thanks to the initiative of groups and individuals pushing for more respectful and inclusive communities. I have always thought that these were essential and useful, because they seemed to be making some conference attendees or project members feel safer (and not making myself feel less so), and would probably help keeping jerk-like behavior at bay.</p>
<p><a href="https://djangogirls.org/pyconuk/">Safe spaces</a> were organized in conferences (I even helped on a few myself, as a tutor), and I thought it was a wonderful idea. I did not take anything away from the most represented types of conference attendees, and allowed less represented people to be able to take their marks in a safe environment.</p>
<p>However, I feel something changed for me when I read the proposal to replace the <code>master/slave</code> terminology by <code>leader/follower</code> in the <a href="https://github.com/django/django/pull/2692">Django framework</a>. The PR starts with the following stance:</p>
<blockquote>
<p>The docs and some tests contain references to a master/slave db configuration.
While this terminology has been used for a long time, those terms may carry racially charged meanings to users.</p>
</blockquote>
<p>My view at the time was <em>"I mean, it does not really change anything for me, and if it can help people feel better..."</em>. Looking back, I'm pretty sure I felt a bit of unease reading the PR, but I (subconsciously or not) shrugged it off.</p>
<p>That debate recently resurfaced when the same thing happened in both the <a href="https://github.com/antirez/redis/issues/5335">redis</a> and the <a href="https://bugs.python.org/issue34605">CPython</a> codebases.
Reading what antirez (the redis creator) <a href="http://antirez.com/news/122">had to say on the subject</a> was a real moment of clarity for me.</p>
<blockquote>
<p>Today it happened again. A developer, that we’ll call Mark to avoid exposing his real name, read the Redis 5.0 RC5 change log, and was disappointed to see that Redis still uses the “master” and “slave” terminology in order to identify different roles in Redis replication.</p>
<p>I said that I was sorry he was disappointed about that, but at the same time, I don’t believe that terminology out of context is offensive, so if I use master-slave in the context of databases, and I’m not referring in any way to slavery. I originally copied the terms from MySQL, and now they are the way we call things in Redis, and since I do not believe in this battle (I’ll tell you later why), to change the documentation, deprecate the API and add a new one, change the INFO fields, just to make a subset of people that care about those things more happy, do not make sense to me.</p>
<p>After it was clear that I was not interested in his argument, Mark accused me of being fascist.</p>
</blockquote>
<p>At this point, I realized the landscape had dramatically changed, and that the inclusiveness debate had morphed into a more politicized and (according to me) confused and sterile version of itself.</p>
<p>Case in point, someone suggested the <a href="https://www.python.org/dev/peps/pep-0020/">Zen of Python</a> should be <a href="https://mail.python.org/pipermail/python-ideas/2018-September/053365.html">modified</a> because the sentence <em>Beautiful is better than ugly</em> could be interpreted as a support for body-shaming behaviors. Words cannot express how wrong this feels to me. That suggestion shows both a profound lack of contextual thinking, and a will to advance a pro political correctness agenda.</p>
<p>People have been talking about <a href="https://www.amazon.com/Beautiful-Code-Leading-Programmers-Practice/dp/0596510047">Beautiful Code</a> and <a href="http://uglycode.com/">Ugly Code</a> for a <strong>long time</strong>. Long enough to write books about it. Long enough so that I could have late night discussions about it with my father (who's also a computer scientist). To me, suggesting that <em>Beautiful is better than ugly</em> encourages body shaming feels alien, because it's completely <strong>out of context</strong>. Words have certain meanings in certain contexts. That's how we get away with synonyms. In the context of the <a href="https://www.python.org/dev/peps/pep-0020/">Zen of Python</a>, the word <em>Beautiful</em> clearly characterizes code, not people. The <a href="https://en.wikipedia.org/wiki/Dwarf_star">Dwarf Star</a> term defines a certain type of star, with given astrophysical properties. Should the entire astrophysics community rename it just because some people feel it's an offensive way of calling <a href="https://fr.wikipedia.org/wiki/Peter_Dinklage">Peter Dinklage</a>? Similar humorous (or not?) <a href="https://bugs.python.org/msg324816">counter-arguments</a> were offered during the CPython master/slave debate.</p>
<p>It seems all we read about now (especially after Linus Torvalds' <a href="https://lkml.org/lkml/2018/9/16/167">temporary stepdown</a>) is either written by <a href="https://medium.com/culture-null/how-sjws-infiltrated-the-open-source-community-21001e7059ef">strong meritocracy</a> <a href="https://lkml.org/lkml/2018/9/16/198">partisans</a>, <a href="https://www.reddit.com/r/linux/comments/9ghrrj/linuxs_new_coc_is_a_piece_of_shit/e64h04h/">conspiracy theorists</a> or by <a href="https://archive.is/dgilk">strong inclusiveness defenders</a> (I've decided not to use the term SJW, as I understand it's a <a href="https://en.wikipedia.org/wiki/Social_justice_warrior">mocking and pejorative</a> term).</p>
<p>It was even <a href="https://mail.python.org/pipermail/python-ideas/2018-September/053369.html">suggested</a> and <a href="https://mail.python.org/pipermail/python-ideas/2018-September/053375.html">debated</a> whether that this suggestion was made by a troll. The fact that, troll or not, that discussion lingered for several days is a very serious issue to me. It shows how polarized the debate now is, and how easily a strong community can be derailed.</p>
<h3 id="about-inclusivity-diversity-and-context">About inclusivity, diversity and context</h3>
<p>The core of the debate is focused on inclusivity and diversity (see <a href="https://bugs.python.org/issue34605">this example</a>), which got me thinking. It's clear to me <em>why</em> we want to push for diversity:</p>
<ul>
<li>a body of similar minds will likely producer similar solutions to a problem, causing the final adopted solution to be <a href="https://www.dailymail.co.uk/sciencetech/article-4800234/Is-soap-dispenser-RACIST.html">more narrowed</a></li>
<li>a person could (subconsciously or not) avoid a given career path / community because she/he might not feel represented enough, and thus feel excluded or as though he/she does not belong</li>
</ul>
<p>I want to focus on the second point, because I'm of the opinion that this is where the heated debates stem from.</p>
<p>If you read the <a href="https://www.contributor-covenant.org/version/1/4/code-of-conduct">Code of Conduct Covenant</a>, which is a code of conduct most of the current conferences and community use or are based on, the text starts with:</p>
<blockquote>
<p>In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.</p>
</blockquote>
<p>I naturally tend to agree with this. We should all strive for inclusiveness and diversity, and should make sure everyone is treated gently and is given a friendly, open hand, whomever they are.
However, if I were fostering malicious intent, I could point out that this list does not cover diets. I myself am a flexitarian (I've cut out all fish and meat from my daily diet, but will eat some without issue if there's no other option). I could somehow feel unrepresented or even excluded from a given community if its CoC does not state that my personal diet should be respected.</p>
<p>Although that example could seem frivolous or ridiculous, it points out something I feel is interesting. That whole paragraph attempts at listing all the way people could differ, to make sure everyone is explicitly included. I would personally have phrased it a more open-ended way:</p>
<blockquote>
<p>In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of who they are and how they identify.</p>
</blockquote>
<p>as I think some issues stem from the fact we have attempted to list what constitutes "diversity". If a tech conference decides to impose quotas on speakers, these quotas will focus on certain attributes (eg sex and skin color) while missing others (eg age, education), which might help some people to better identify to the speakers, but might not help others. This <em>inventaire à la Prévert</em> certainly looks like inclusiveness, but I think it misses the point.</p>
<p>How we identify is both subjective and subject to context. I might identify as an SRE, an engineer or a Python developer in the context of work or a tech-related event, a social extrovert in the context of a party, an leftist heterosexual male in the context of my personal and private life, etc.</p>
<p>How we identify depends on context, and yet, we seem intent on mixing personal identities and non-personal contexts, the same way accusing <em>Beautiful is better than ugly</em> to promote body shaming mixes human and technological contexts. I recognize that some situations are trickier than others (eg conferences, workplaces), because they can mix personal and professional contexts, thus blurring the lines.</p>
<p>If diversity is defined as having multiple identities present, then diversity must be subjective and subject to context too. To follow in that tech conference example, I feel diversity in the technological content should reside in education background, level of experience and field of interest of the speaker, while diversity in the social events tied to the conference could have a totally different definition.</p>
<p>These criterion are my pick, but I suggest you clearly and openly define which ones matter to you if you're ever in the position of selecting speakers or employees.</p>
<h3 id="closing-words">Closing words</h3>
<p>In my view, the tech industry as a whole has been guilty of resistance to change by kicking around the old meritocracy horse for too long. We need to talk about the lack of women, the rampant misogynist attitudes, the male/women pay gap. We need to fix these issues by acknowledging them first, and debating them transparently, in a less polarized way. Not just as an industry, but as a society.</p>
<p>However, as I don't buy in the "show me the code or GTFO" attitude, I don't believe in politically correctness before everything else. If some people lack the ability to recognize that <em>Beautiful is better than ugly</em> in the <a href="https://www.python.org/dev/peps/pep-0020/">Zen of Python</a> does not body shame people, then maybe we shouldn't let them define what our core values are.</p>Solution to Advent of Code "Day 3: Spiral Memory"2017-12-31T00:00:00+01:002017-12-31T00:00:00+01:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2017-12-31:/solution-to-advent-of-code-day-3-spiral-memory<p>After an unsuccessful attempt at learning Rust earlier this year (I mainly read through the documentation without applying it in any project), I recently started to tackle the <a href="https://adventofcode.com/2017/">2017 edition of Advent of Code</a>, in order to practice Rust for real.</p>
<p>The 3rd challenge, <a href="https://adventofcode.com/2017/day/3"><em>Spiral Memory</em></a> is interesting because you …</p><p>After an unsuccessful attempt at learning Rust earlier this year (I mainly read through the documentation without applying it in any project), I recently started to tackle the <a href="https://adventofcode.com/2017/">2017 edition of Advent of Code</a>, in order to practice Rust for real.</p>
<p>The 3rd challenge, <a href="https://adventofcode.com/2017/day/3"><em>Spiral Memory</em></a> is interesting because you can <a href="https://gist.github.com/pawlos/0cefa9d753bd6416e6cc9a456ed787f7">bruteforce</a> it, or solve it with math. I ended up doing the latter, even though math is really not my strong suit.</p>
<p>We're asked to calculate the <a href="https://en.wikipedia.org/wiki/Taxicab_geometry">Manhattan distance</a> between a given point and the center, in a spiral reference. The problem amounts to finding the coordinates of any point $P$ in this spiral reference, as once we have the point coordinates, calculating the Manhattan distance is easy:</p>
<p>\begin{align<em>}
D_P &= |X_P - X_0| + |Y_P - Y_0| \
&= |X_P| + |Y_P|
\end{align</em>}</p>
<h2 id="nested-shells">Nested shells</h2>
<p>My approach was the following: a spiral has nested "shells", all centered around the center. In this image, the first shell is outlined in grey, and the second one in purple. Each of these spirals has a first value, called $S_i$, where $i$ is the index of the spiral.</p>
<p><img alt="spiral" decoding="async" loading="lazy" src="images/memory-spiral.jpg"></p>
<p>For any point $(X_P, Y_P)$ of value $V$, we know that it is located somewhere on the shell located right before the first shell with start value $S$ such as $S > V$. For example, if the input value was 23, we know that it's located on the second shell as $S_2 ≤ 23 < S_3$.</p>
<p><img alt="spiral" decoding="async" loading="lazy" src="images/spiral-shells.jpg"></p>
<p>We need to know the number of elements a shell of index $i$ is composed of, noted $Δ_i$ On this representation, the first shell is a square of side of length 3, the second shell is a square of side of length 5. We can generalize this to $L = 2i + 1$, where $i$ is the index of the shell. For any index $i$, the shell is composed of the following number of elements</p>
<p>\begin{align<em>}
Δ_i &= (2i + 1)^2 - (2(i -1) + 1)^2 \
&= 4i^2 + 4i + 1 - 4i^2 +4i - 1 \
&= 8i
\end{align</em>}</p>
<h2 id="coordinates-of-the-first-element-of-a-shell">Coordinates of the first element of a shell</h2>
<p>Once we know on which shell a given point $P$ is located, we need to know the coordinates of the first point $S_i$ of this shell, so we can infer $P$'s coordinates. This first point will always be located after the center point, and all points composing the previous shells. We can thus infer</p>
<p>\begin{equation<em>}
V_{S_{{}<em x="1">i}} = 2 + \sum</em>^{i-1}Δ_i
\end{equation</em>}</p>
<p>We now need to get the coordinates of any given first shell point. By simply looking at the spiral itself, we can deduce that</p>
<p>\begin{equation<em>}
(X_{S_{{}<em S__="S_{{">i}}, Y</em>_i}}) = (n, -n + 1)
\end{equation</em>}</p>
<h2 id="navigating-the-spiral">Navigating the spiral</h2>
<p>The final piece of the puzzle is to infer the coordinates of the point $P$ given the coordinates of the start point $S_i$ of the shell it belongs to. To do that, we need to look at how the coordinates evolve along a shell.</p>
<p><img alt="spiral" decoding="async" loading="lazy" src="images/shell-coordinates.jpg"></p>
<p>We can see that:</p>
<ul>
<li>on the first quarter of the shell, $Y$ coordinates increase by 1 for each increasing value</li>
<li>on the second quarter of the shell, $X$ coordinates decrease by 1 for each increasing value</li>
<li>on the third quarter of the shell, $Y$ coordinates decrease by 1 for each increasing value</li>
<li>on the fourth quarter of the shell, $X$ coordinates increase by 1 for each increasing value</li>
</ul>
<p>To calculate the coordinates of the point $P$, we just need to locate it on the shell, start from $(X_{S_{{}<em S__="S_{{">i}}, Y</em>_i}})$ and increase/decrease the $X$ and $Y$ coordinates until we reach the target value.</p>
<h2 id="the-implementation">The implementation</h2>
<p>The strategy is:</p>
<ul>
<li>calculate the values of the first shell points until we find a value greater than our target point</li>
<li>backtrack to the previous shell</li>
<li>compute the coordinates of the first point of the shell we backtracked to</li>
<li>increase/decrease the $X$ and $Y$ coordinates until we reach the target value</li>
<li>calculate the Manhattan distance using these coordinates</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="c1">// advent_day03.rs</span>
<span class="k">fn</span> <span class="nf">nb_elements_in_outer_level</span><span class="p">(</span><span class="n">level</span>: <span class="kt">i32</span><span class="p">)</span><span class="w"> </span>-> <span class="kt">i32</span><span class="p">{</span>
<span class="w"> </span><span class="mi">8</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">level</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">start_element</span><span class="p">(</span><span class="n">level</span>: <span class="kt">i32</span><span class="p">)</span><span class="w"> </span>-> <span class="kt">i32</span> <span class="p">{</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">level</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">0</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="mi">1</span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="mi">1</span><span class="o">..</span><span class="n">level</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">out</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">nb_elements_in_outer_level</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="n">out</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">2</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">first_element_coordinates</span><span class="p">(</span><span class="n">level</span>: <span class="kt">i32</span><span class="p">)</span><span class="w"> </span>-> <span class="p">(</span><span class="kt">i32</span><span class="p">,</span><span class="w"> </span><span class="kt">i32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="p">(</span><span class="n">level</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">level</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">)</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">number_coordinates</span><span class="p">(</span><span class="n">number</span>: <span class="kt">i32</span><span class="p">)</span><span class="w"> </span>-> <span class="p">(</span><span class="kt">i32</span><span class="p">,</span><span class="w"> </span><span class="kt">i32</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">level</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">start</span>: <span class="kt">i32</span><span class="p">;</span>
<span class="w"> </span><span class="c1">// Increase level until we found a starting value greater than</span>
<span class="w"> </span><span class="c1">// input value. When such a value is found, backtrack a step.</span>
<span class="w"> </span><span class="k">loop</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">start_element</span><span class="p">(</span><span class="n">level</span><span class="p">);</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">number</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">level</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span>
<span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">"{:?} is found on level {:?} of the spiral"</span><span class="p">,</span><span class="w"> </span><span class="n">number</span><span class="p">,</span><span class="w"> </span><span class="n">level</span><span class="p">);</span>
<span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">start_element</span><span class="p">(</span><span class="n">level</span><span class="p">);</span>
<span class="w"> </span><span class="k">break</span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">level</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="c1">// At this point, we've found the starting point of the spiral</span>
<span class="w"> </span><span class="c1">// level we number belongs to.</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">number</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">start</span><span class="p">;</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="p">(</span><span class="k">mut</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">first_element_coordinates</span><span class="p">(</span><span class="n">level</span><span class="p">);</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">level</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">level</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">delta</span><span class="p">;</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="mi">4</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">level</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">level</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">level</span><span class="p">);</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="mi">6</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">level</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">level</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">-=</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="mi">4</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">level</span><span class="p">);</span>
<span class="w"> </span><span class="k">return</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="mi">6</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">level</span><span class="p">);</span>
<span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">manhattan_distance</span><span class="p">(</span><span class="n">x</span>: <span class="kt">i32</span><span class="p">,</span><span class="w"> </span><span class="n">y</span>: <span class="kt">i32</span><span class="p">)</span><span class="w"> </span>-> <span class="kt">i32</span><span class="p">{</span>
<span class="w"> </span><span class="n">x</span><span class="p">.</span><span class="n">abs</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">y</span><span class="p">.</span><span class="n">abs</span><span class="p">()</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">main</span><span class="p">()</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">number</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">312051</span><span class="p">;</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">number_coordinates</span><span class="p">(</span><span class="n">number</span><span class="p">);</span>
<span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">"{:?} has coordinates {:?}"</span><span class="p">,</span><span class="w"> </span><span class="n">number</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">));</span>
<span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="n">distance</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">manhattan_distance</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">);</span>
<span class="w"> </span><span class="fm">println!</span><span class="p">(</span><span class="s">"{:?} is at a distance of {:?} from the center"</span><span class="p">,</span><span class="w"> </span><span class="n">number</span><span class="p">,</span><span class="w"> </span><span class="n">distance</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div>
<h2 id="the-solution">The solution</h2>
<div class="highlight"><pre><span></span><code><span class="mi">312051</span> <span class="k">is</span> <span class="n">found</span> <span class="n">on</span> <span class="nb">level</span> <span class="mi">279</span> <span class="nb">of</span> <span class="n">the</span> <span class="n">spiral</span>
<span class="mi">312051</span> <span class="k">has</span> <span class="n">coordinates</span> (-<span class="mi">152</span>, -<span class="mi">278</span>)
<span class="mi">312051</span> <span class="k">is</span> <span class="nb">at</span> <span class="n">a</span> <span class="n">distance</span> <span class="nb">of</span> <span class="mi">430</span> <span class="nb">from</span> <span class="n">the</span> <span class="n">center</span>
</code></pre></div>On working from home while remaining sane2017-10-29T00:00:00+02:002017-10-29T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2017-10-29:/on-working-from-home-while-remaining-sane<p>Since I started working at <a href="https://datadoghq.com">Datadog</a>, I've had the opportunity of working from home full-time (for the second time in my career). Although I consider this to be a real privilege, it comes with its own set of challenges that I'd like to pinpoint and address in light of my …</p><p>Since I started working at <a href="https://datadoghq.com">Datadog</a>, I've had the opportunity of working from home full-time (for the second time in my career). Although I consider this to be a real privilege, it comes with its own set of challenges that I'd like to pinpoint and address in light of my personal experiences.</p>
<p>I hope this article will be useful for anyone willing to try out (or struggling with) remote work.</p>
<h2 id="productivity-vs-isolation">Productivity VS isolation</h2>
<p>First, why would you even want to work from home in the first place? To me, it's both about flexibility and productivity. I can focus on complex tasks for long periods of time without being <a href="http://heeris.id.au/2013/this-is-why-you-shouldnt-interrupt-a-programmer/">interrupted</a>, while still being able to keep a flexible timetable. I can also work from anywhere, as long as I can have a good enough internet connection.</p>
<p>However, this flexibility and freedom is paid with isolation, which can then lead to demotivation or burn-out down the road. Remote work is, by definition, solitary, which can quickly become an issue, because humans are social animals and (for most of us) crave for social and physical interaction. This make be believe that remote workers are more exposed to burn-out.</p>
<h2 id="the-burn-out-cycle">The burn-out cycle</h2>
<p>In my experience, the easiest path to demotivation or burn-out (whether you're working remotely or not) is being over-enthusiast and working long hours. When doing so, it's easy to develop some kind of <em>hero complex</em>, a belief that you're indispensable and that things will break down if you take a break, or leave on holidays. The more hours you pull, the less sleep you get, the more stressed and tired you become. Because you're stressed, you then feel you need to work harder, until you just can't take it anymore, and you burn-out.</p>
<p>Ideally, this cycle can be prevented or broken with proper management and supervision. If your manager realises you've started to walk this slippery slope, she/he should take action, and incite/force you to take a break. This can be enforced by regular 1-1 meetings, to keep track on how remote workers are doing.</p>
<p>This brings me to an important point: <strong>remote work can dangerous if it's not in the company culture, and you should keep away</strong>.</p>
<h2 id="remote-as-a-culture">Remote as a culture</h2>
<p>To enable sane remote work, a company does must include remote workers in all events, when physically possible. All brown-bags, talks, all-hands, etc, should be streamed live, or at least recorded. If being out of the office means you have access to less information, it means that remote workers are seen as second-rate employees.</p>
<p>All communication must be asynchronous, to include remote workers, especially if teams are working across timezones. Wether it's slack, email, Google Docs or something else, anyone should be able to catch up with any conversation or topic. Any significant direct discussion should be made available one way or another to remote workers.</p>
<p>Finally, it should be easy to go meet your team in person. I'd go even further and recommend you do it on a regular basis. I personally chose to go to our Paris office a week every month.</p>
<h2 id="work-hygiene">Work hygiene</h2>
<p>Now, if your company has remote in its blood and culture, good! Now, all is left to figure out is <em>your</em> organisation and work hygiene. The following advice come from my personal experience and should not be considered as absolute truths backed by science. Take them if they make sense to you.</p>
<h3 id="containerisation-of-private-and-personal-life">Containerisation of private and personal life</h3>
<p>The first thing I find absolutely essential is containerisation (no, not Docker) of your private and professional life. You need to have a dedicated office room, with a door, that is not your living room. The idea is that, when you open that door, you're at work, and when you close it, you're out. I find it to be especially important during the first weeks of remote work. I now find myself work more and more from my living room, but I know that if I need isolation for some reason, I still have this room I can go to.</p>
<p>For the same reason (along with a bazillion security reasons), never work from your personal machine. You want to make it a conscious effort to switch from watching Netflix to reading your work email.</p>
<h3 id="routine">Routine</h3>
<p>To me, routine is key to avoid getting tired. Try to wake up, start working, eat, stop working and go to sleep at regular hours. Ban any night work, especially when you're not on-call.</p>
<p>Exercise is also very important. It's easy to keep extremely static during your remote work-day, which can take a toll on your health. Also, one of the things I miss the most is my daily bike commute. I replaced it with 45 minutes of gym in the morning, 3 times a week. This has the nice advantage of making me feel like I accomplished something very early in the day, and gives me energy to keep it going.</p>
<p>Also, take regular breaks and take a 15 minute walk at some point in the day.</p>
<h3 id="share">Share</h3>
<p>Talk with other remote workers about how <em>they</em> make it. Share tips, stories, do-s and don't-s, to build collective wisdom.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Working from home can be liberating and an amazing productivity booster, but you need to stay alert and conscious of the challenges and constraints it entails. I'd urge you to show a fair amount of self-discipline and organisation in order to avoid falling into the burn-out spiral.</p>
<p>Have fun!</p>
<h2 id="edit">Edit</h2>
<p>I found this very interesting resource from Trello, called <a href="https://info.trello.com/hubfs/Trello-Embrace-Remote-Work-Ultimate-Guide.pdf">How to embrace remote work</a>.</p>
<p>The main takeways I get from it are:</p>
<ul>
<li>pace yourself: work isn't going anyhwere. Do not forget to take breaks.</li>
<li>use the right tool to convey the right information (do not rely on instant messaging for crucial information!)</li>
<li>don't forget to use passive communication (eg: status messages)</li>
<li>intent can be lost over text communication. Assume <strong>positive</strong> intent.</li>
</ul>The story of the 20°C cronjob2017-05-25T00:00:00+02:002017-05-25T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2017-05-25:/the-story-of-the-20degc-cronjob<p>For the last month or so, the lifespan of my beloved Thinkpad X1 Carbon battery had been getting down the drains, from 5-6 hours to less than 3. Following <a href="https://twitter.com/padenot">@padenot</a>'s advice, I installed <code>powertop</code> and started investigating what was draining this good'ol battery of mine.</p>
<p>Looking at the <code>powertop …</code></p><p>For the last month or so, the lifespan of my beloved Thinkpad X1 Carbon battery had been getting down the drains, from 5-6 hours to less than 3. Following <a href="https://twitter.com/padenot">@padenot</a>'s advice, I installed <code>powertop</code> and started investigating what was draining this good'ol battery of mine.</p>
<p>Looking at the <code>powertop</code> output, I immediatly realized that something fishy was happening on this laptop:</p>
<div class="highlight"><pre><span></span><code>The battery reports a discharge rate of 4.95 W
The estimated remaining time is 2 hours, 6 minutes
Summary: 1111.7 wakeups/second, 7.9 GPU ops/seconds, 0.0 VFS ops/sec and 23.0% CPU use
Usage Events/s Category Description
264.4 ms/s 3656.7 Process /bin/bash /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f br
114.3 ms/s 626.2 Process /usr/lib64/firefox/firefox
20.7 ms/s 95.5 Process /opt/sublime_text_3/plugin_host 3272
...
</code></pre></div>
<p>Why was <code>sendmail</code> so busy, and why in the hell was it running anyway? <code>strace</code> showed me that the process was indeed very busy, and <code>mailq</code> showed that I had more than 15000 outgoing emails in the system mail queue!</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>mailq
...
mail<span class="w"> </span><span class="k">in</span><span class="w"> </span>dir<span class="w"> </span>/home/br/.esmtp_queue/TSRueRJD:
<span class="w"> </span>From:<span class="w"> </span><span class="s2">"(Cron Daemon)"</span><span class="w"> </span><br><span class="w"> </span>To:<span class="w"> </span>br
mail<span class="w"> </span><span class="k">in</span><span class="w"> </span>dir<span class="w"> </span>/home/br/.esmtp_queue/ZI1LtzhT:
<span class="w"> </span>From:<span class="w"> </span><span class="s2">"(Cron Daemon)"</span><span class="w"> </span><br><span class="w"> </span>To:<span class="w"> </span>br
<span class="m">15653</span><span class="w"> </span>mails<span class="w"> </span>to<span class="w"> </span>deliver
</code></pre></div>
<p>Ok, so all these mails were being sent by <code>cron</code>. My user crontab only had one job, and it was <code>* * * * * rm $HOME/crash_dump.erl</code>. Indeed, I had been experimenting with <a href="http://elixir-lang.org/">Elixir</a> recently, and when I crashed the Erlang VM, this file would pop-up in my home directory. At some point, I added this cronjob to make it go away and forgot about it. As the job's <code>stdout</code> was not redirected to <code>/dev/null</code>, each time the file was not found, the cron job would fail and a mail would be added to the queue.</p>
<p>After removing this job, purging the mail queue, and adding <code>MAILTO=""</code> at <a href="https://www.cyberciti.biz/faq/disable-the-mail-alert-by-crontab-command/">the beginning of my crontab</a> (to avoid repeating this investigation down the road), <code>sendmail</code> went quiet, my battery life went back to ~6 hours, and the laptop average temperature went down 20°C.</p>Preparing the SRE interview2017-04-20T00:00:00+02:002017-04-20T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2017-04-20:/preparing-the-sre-interview<p>I recently interviewed for an <abbr title="Site Reliability Engineer">SRE</abbr> position. I spent a full week learning (or refreshing my memory) on the subjects and topics that could be covered in such an interview. I'll try and lay down the list of topics I covered and resources I used.</p>
<h2 id="what-is-an-sre">What is an SRE?</h2>
<p>Having …</p><p>I recently interviewed for an <abbr title="Site Reliability Engineer">SRE</abbr> position. I spent a full week learning (or refreshing my memory) on the subjects and topics that could be covered in such an interview. I'll try and lay down the list of topics I covered and resources I used.</p>
<h2 id="what-is-an-sre">What is an SRE?</h2>
<p>Having spent the last 2 years employed as a DevOps, I've often felt that DevOps and SRE were two slightly differing implementations of the same ideas. The first one felt like a set of general principles, when the second one is a clear and detailed model (pre-dating DevOps), with a set of rules and guidelines. Google developed the SRE model and explained it in the <a href="https://landing.google.com/sre/book.html">SRE book</a>. The underlying ideas are simple, but powerful:</p>
<ul>
<li>Develop tools and systems reducing toil and repetitive work from engineers</li>
<li>Automate everything, or as much as possible (deployments, maintenances, tests, scaling, mitigation)</li>
<li>Monitor everything</li>
<li>Think scalable from the start</li>
<li>Build <strong>resilient-enough</strong> architectures</li>
<li>Handle change and risk through <abbr title="Service Level Agreement">SLA</abbr>s, <abbr title="Service Level Objective">SLO</abbr>s and <abbr title="Service Level Indicator">SLI</abbr>s</li>
<li>Learn from outages</li>
</ul>
<p>If you haven't yet read the <a href="https://landing.google.com/sre/book.html">SRE book</a>, I strongly urge you to do so. There's even a <a href="https://landing.google.com/sre/book/index.html">free online version</a> available. If you do not have the time, then maybe have a look at this Ben Treynor (Google VP Engineering) <a href="https://landing.google.com/sre/interview/ben-treynor.html">What is 'Site Reliability Engineering'?</a> interview, for a general introduction.</p>
<p>According to the SRE book, an SRE should spend half of its time on "ops" work, and the other half doing development.</p>
<blockquote>
<p>Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc. [...] An SRE team must spend the remaining 50% of its time actually doing development.
<a href="https://landing.google.com/sre/book/chapters/introduction.html">Source</a></p>
</blockquote>
<p>Some skills are thus paramount to an SRE:</p>
<ul>
<li>coding / software development</li>
<li>system administration and automation</li>
<li>scalable system design</li>
<li>system troubleshooting</li>
</ul>
<p>Consequently, each of these areas of expertise can be (and often are) the subject of an interview.</p>
<h2 id="coding-software-development-interview">Coding / Software development interview</h2>
<p>I've found that the reference resource to prepare a coding interview, especially when targeting companies like Amazon, Google, Microsoft, Yahoo, etc, is <a href="https://www.amazon.com/Cracking-Coding-Interview-Programming-Questions/dp/0984782850/ref=sr_1_1?ie=UTF8&qid=1492689425&sr=8-1&keywords=cracking+the+coding+interview">Cracking the Coding Interview</a>, by <a href="https://www.amazon.com/Gayle-Laakmann-McDowell/e/B004BI1ZUQ/ref=dp_byline_cont_book_1">Gayle Laakmann McDowell</a>. This book is a real trove of advice (technical or not) and example exercises (with the associated solutions).</p>
<p>Even though it is targeted to <em>software developer</em> interviews, I still covered the following topics listed in the <em>Must Know</em> section of the book:</p>
<p><strong>Data structures</strong>:</p>
<ul>
<li>Linked list</li>
<li>Stack</li>
<li>Queue</li>
<li>Heap</li>
<li>Hash table</li>
<li>Binary tree</li>
<li>associated Big-O <a href="http://bigocheatsheet.com/">time and memory complexity</a> for common operations (Search, insert, delete, etc).</li>
</ul>
<p>I found <a href="https://www.amazon.com/Data-Structures-Algorithms-Using-Python/dp/1590282337">Data structures and Algorithms using Python and C++</a> to be useful (albeit a bit lengthy) when dealing with these data structures for the first time. This <a href="http://www.columbia.edu/~jxz2101/#">presentation</a> gives a short but to-the-point, no-nonsense introduction of these data structures.</p>
<p><strong>Algorithms</strong></p>
<ul>
<li>Mergesort</li>
<li>Quicksort</li>
<li>Binary search</li>
</ul>
<p>I also had a look at <a href="https://github.com/adicu/interview_help/">https://github.com/adicu/interview_help</a> to practice on some real-life interview questions, and at <a href="https://github.com/nryoung/algorithms">https://github.com/nryoung/algorithms</a> to read Python implementations of common data structures and algorithms.</p>
<h2 id="scalable-system-design-interview">Scalable system design interview</h2>
<p>This was my favorite subject to work on, as an apparently simple question such as "Design the bit.ly service" hides unexpected depths of complexity. Being able to design a scalable system implies knowing about:</p>
<ul>
<li>DNS</li>
<li>load balancing</li>
<li>micro-service architecture</li>
<li>CAP theorem</li>
<li>consistency patterns</li>
<li>availability patterns</li>
<li>databases</li>
<li>caching</li>
<li>asynchronism patterns</li>
<li>etc</li>
</ul>
<p>The main idea is to be able to identify the architecture bottlenecks, and to dimension the architecture with an appropriate number of machines, with some "back-of-the-envelope" calculations, whilst being robust and failure tolerant.</p>
<p>The most useful resources I found to prepare were:</p>
<ul>
<li><a href="https://www.youtube.com/watch?v=-W9F__D3oY4">Scalability lecture</a> given at Harvard</li>
<li><a href="http://norvig.com/21-days.html#answers">Latency Numbers Every Programmer Should Know</a></li>
<li><a href="https://github.com/donnemartin/system-design-primer">The System Design Primer</a> (I suggest you follow the links after each section for an in-depth follow-up)</li>
<li>this great <a href="https://www.hiredintech.com/classrooms/system-design/lesson/52">step-by-step walkthrough</a> on design questions, by HiredInTech</li>
<li><a href="https://www.youtube.com/watch?v=vg5onp8TU6Q">Scaling up to your first 10 million users</a>, talk given by Joel Williams of AWS</li>
<li><a href="http://www.puncsky.com/blog/2016/02/14/crack-the-system-design-interview/">Crack the design interview</a></li>
<li><a href="https://docs.microsoft.com/en-us/azure/architecture/guide/technology-choices/data-store-overview">When to use NoSQL vs SQL</a></li>
</ul>
<h2 id="system-troubleshooting-interview">System troubleshooting interview</h2>
<p>To be able to automate the administration of a system, one should first know the said system in depth, which, in a lot of cases, will be GNU/Linux. If you have time, I strongly suggest reading <a href="https://www.amazon.com/Linux-Programming-Interface-System-Handbook/dp/1593272200/ref=sr_1_1?ie=UTF8&qid=1492692882&sr=8-1&keywords=linux+programming+interface">The Linux Programming Interface</a>. Note that this is a <strong>large</strong> book (my version has 1556 pages) focusing on an old version of the Linux kernel (2.6.x). Fear not! You'll still gain a vast knowledge about how a GNU/Linux system operates. For a quicker tour, you could have a look at the <a href="http://learnlinuxconcepts.blogspot.fr/2014/10/this-blog-is-to-help-those-students-and.html">Linux Kernel Internals</a> blog. You'll also find interesting SRE interview questions/answers in this <a href="https://syedali.net/engineer-interview-questions/">SRE interview questions</a> blogpost.</p>
<p><a href="https://jvns.ca/">Julia Evans</a>, also known as <a href="https://twitter.com/b0rk">b0rk</a> has written some absolutely <strong>fantastic</strong> beginner-friendly resources about troubleshooting and networking.
I strongly recommend having a look at:</p>
<ul>
<li><a href="http://jvns.ca/debugging-zine.pdf">the debugging zine</a></li>
<li><a href="https://jvns.ca/networking-zine.pdf">networking! ACK!</a></li>
<li><a href="http://jvns.ca/strace-zine-v2.pdf">How to spy on your programs with <code>strace</code></a></li>
</ul>
<p>Mastering the mentioned tools (<code>strace</code>, <code>tcpdump</code>, <code>netstat</code>, <code>lsof</code>, <code>ngrep</code>, etc) gave me some good debugging chops I have applied in production many times.</p>
<p>Netflix has also written a very nice and thorough blogpost on performance troubleshooting: <a href="http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html">Linux Performance Analysis in 60,000 Milliseconds</a>, detailing what to check in case of a performance issue.</p>
<h2 id="wait-theres-more">Wait, there's more</h2>
<p>Technical knowledge is one thing, but SRE being a relatively new activity, I also wanted to get real-life feedbacks from real-life SREs. To that end, I watched the following (great) talks:</p>
<ul>
<li><a href="https://www.usenix.org/conference/srecon15/program/presentation/limoncelli">Case Study: Adopting SRE Principles at StackOverflow</a>, by Tom Limoncelli of Stack Exchange</li>
<li><a href="https://www.youtube.com/watch?v=fsTpRx8Pt-k">Love DevOps? Wait until you meet SRE</a>, by Nick Wright, from Atlassian</li>
<li><a href="https://www.usenix.org/conference/srecon17americas/program/presentation/training-new-sres">Panel: training new SREs</a>, with Katie Ballinger (CircleCI), Saravanan Loganathan (Yahoo), Rita Lu (Google), Craig Sebenik (Matterport), Andrew Widdowson (Google)</li>
</ul>
<h2 id="oh-and-one-last-thing">Oh and one last thing...</h2>
<blockquote class="twitter-tweet" data-lang="fr"><p lang="en" dir="ltr">I'm super excited to announce I'm joining <a href="https://twitter.com/datadoghq">@datadoghq</a> as an SRE ! <a href="https://t.co/Ji1JJQLJ4x">pic.twitter.com/Ji1JJQLJ4x</a></p>— Balthazar Rouberol (@brouberol) <a href="https://twitter.com/brouberol/status/854620051307196417">19 avril 2017</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>Celery best practices2015-12-29T00:00:00+01:002015-12-29T00:00:00+01:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2015-12-29:/celery-best-practices<p>I've been programming with <a href="http://celery.readthedocs.org/">celery</a> for the last three years, and <a href="https://denibertovic.com/pages/about-me/">Deni Bertović</a>'s article about <a href="https://denibertovic.com/posts/celery-best-practices/">Celery best practices</a> has truly been invaluable to me. In time, I've also come up with my set of best practices, and I guess this blog is as good a place as any to …</p><p>I've been programming with <a href="http://celery.readthedocs.org/">celery</a> for the last three years, and <a href="https://denibertovic.com/pages/about-me/">Deni Bertović</a>'s article about <a href="https://denibertovic.com/posts/celery-best-practices/">Celery best practices</a> has truly been invaluable to me. In time, I've also come up with my set of best practices, and I guess this blog is as good a place as any to write them down.</p>
<h2 id="write-short-tasks">Write short tasks</h2>
<p>I think that a task should be as concise as possible, in order to be able to understand what it does and how it handles corner cases as quickly as possible. I personally try to follow these rules:</p>
<ul>
<li>wrap the main task logic in an object method or a function</li>
<li>make this method/function raise identified exceptions for identified corner cases and decide what is the logic for each of them</li>
<li>implement a retry mechanism only where appropriate</li>
</ul>
<p>Let's illustrate these rules with a simple example: sending an email using a 3rd party API (eg: <a href="https://mailgun.com">Mailgun</a>, <a href="https://en.mailjet.com/">Mailjet</a>, etc). Anyone having spent enough time using third party infrastructure and systems knows they should never totally rely on them: the network can fail, they can be unavailable, etc. We thus need to handle some expectable error cases and have a fallback strategy in case of an unexpected error.</p>
<p>Let's say that we have a function <code>api_send_mail</code> that does the actual API call, raising a <code>myapp.exceptions.InvalidUserInput</code> exception, in case of an HTTP client error. This exception constitutes our set of expectable exceptions, that we need to plan for. Any other exception (connection error, server HTTP error, etc) will be sent to some crash report backend, like <a href="http://getsentry.com">Sentry</a> and trigger a retry.</p>
<p>My task implementation would look something like this:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">myproject.tasks</span> <span class="kn">import</span> <span class="n">app</span> <span class="c1"># app is your celery application</span>
<span class="kn">from</span> <span class="nn">myproject.exceptions</span> <span class="kn">import</span> <span class="n">InvalidUserInput</span>
<span class="kn">from</span> <span class="nn">utils.mail</span> <span class="kn">import</span> <span class="n">api_send_mail</span>
<span class="nd">@app</span><span class="o">.</span><span class="n">task</span><span class="p">(</span><span class="n">bind</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">max_retries</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">send_mail</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">recipients</span><span class="p">,</span> <span class="n">sender_email</span><span class="p">,</span> <span class="n">subject</span><span class="p">,</span> <span class="n">body</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Send a plaintext email with argument subject, sender and body to a list of recipients."""</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">api_send_mail</span><span class="p">(</span><span class="n">recipients</span><span class="p">,</span> <span class="n">sender_email</span><span class="p">,</span> <span class="n">subject</span><span class="p">,</span> <span class="n">body</span><span class="p">)</span>
<span class="k">except</span> <span class="n">InvalidUserInput</span><span class="p">:</span>
<span class="c1"># No need to retry as the user provided an invalid input</span>
<span class="k">raise</span>
<span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">exc</span><span class="p">:</span>
<span class="c1"># Any other exception. Log the exception to sentry and retry in 10s.</span>
<span class="n">sentrycli</span><span class="o">.</span><span class="n">captureException</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">retry</span><span class="p">(</span><span class="n">countdown</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">exc</span><span class="o">=</span><span class="n">exc</span><span class="p">)</span>
<span class="k">return</span> <span class="n">data</span>
</code></pre></div>
<p>What the task actually does is abstracted one layer down, and almost all the rest of the task body is handling errors. I feel that it's easier to grasp the bigger picture, and that the task is easier to maintain this way.</p>
<h2 id="retry-gracefully">Retry gracefully</h2>
<p>Setting fixed countdowns for retries may not be what you want. I tend to prefer using a backoff increasing with the number of retries. This means the more a task fails, the more we have to wait until the next retry. I think this has a couple of interesting consequences:</p>
<ul>
<li>we don't hammer the external service in case of an outage,</li>
<li>it gives more time to the service to go back to normal,</li>
<li>and thus increases our overall chance of success</li>
</ul>
<p>A simple (but effective anyhow) implementation could look something like this:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">backoff</span><span class="p">(</span><span class="n">attempts</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Return a backoff delay, in seconds, given a number of attempts.</span>
<span class="sd"> The delay increases very rapidly with the number of attemps:</span>
<span class="sd"> 1, 2, 4, 8, 16, 32, ...</span>
<span class="sd"> """</span>
<span class="k">return</span> <span class="mi">2</span> <span class="o">**</span> <span class="n">attempts</span>
<span class="nd">@app</span><span class="o">.</span><span class="n">task</span><span class="p">(</span><span class="n">bind</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">max_retries</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">send_mail</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">recipients</span><span class="p">,</span> <span class="n">sender_email</span><span class="p">,</span> <span class="n">subject</span><span class="p">,</span> <span class="n">body</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Send a plaintext email with argument subject, sender and body to a list of recipients."""</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">api_send_mail</span><span class="p">(</span><span class="n">recipients</span><span class="p">,</span> <span class="n">sender_email</span><span class="p">,</span> <span class="n">subject</span><span class="p">,</span> <span class="n">body</span><span class="p">)</span>
<span class="k">except</span> <span class="n">InvalidUserInput</span><span class="p">:</span>
<span class="k">raise</span>
<span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">exc</span><span class="p">:</span>
<span class="n">sentrycli</span><span class="o">.</span><span class="n">captureException</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">retry</span><span class="p">(</span><span class="n">countdown</span><span class="o">=</span><span class="n">backoff</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">request</span><span class="o">.</span><span class="n">retries</span><span class="p">),</span> <span class="n">exc</span><span class="o">=</span><span class="n">exc</span><span class="p">)</span>
<span class="o">...</span>
</code></pre></div>
<h2 id="fail-fast-and-dont-block-forever">Fail fast and don't block forever</h2>
<p>One thing to remember is to <strong>always</strong> specify a timeout on I/O operations, or at least on the celery task itself. If you don't, it's possible all your tasks could block indefinitely, which would then prevent any additional task to start. In the context of the <code>send_mail</code> task, I could probably do something like this, as an API call should probably not take more than 5 seconds:</p>
<div class="highlight"><pre><span></span><code><span class="nd">@app</span><span class="o">.</span><span class="n">task</span><span class="p">(</span>
<span class="n">bind</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">max_retries</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
<span class="n">soft_time_limit</span><span class="o">=</span><span class="mi">5</span> <span class="c1"># time limit is in seconds.</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">send_mail</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">recipients</span><span class="p">,</span> <span class="n">sender_email</span><span class="p">,</span> <span class="n">subject</span><span class="p">,</span> <span class="n">body</span><span class="p">):</span>
<span class="o">...</span>
</code></pre></div>
<p>If the task takes more than 5 seconds to complete, the <code>celery.exceptions.SoftTimeLimitExceeded</code> exception would get raised and logged to Sentry.</p>
<p>I also tend to set the <a href="https://celery.readthedocs.org/en/latest/configuration.html?highlight=eager#celeryd-task-soft-time-limit"><code>CELERYD_TASK_SOFT_TIME_LIMIT</code></a> configuration option with a default value of 300 (5 minutes). This will act as a failsafe if I forget to set an appropriate <code>soft_time_limit</code> option on a task.</p>
<h2 id="share-common-behavior-among-tasks">Share common behavior among tasks</h2>
<p>All that is pretty dandy, but I don't want to re-implement the exception catching for every task. I should be able to specify a basic behavior shared between all my tasks. Turns out you can, using an <a href="https://celery.readthedocs.org/en/latest/userguide/tasks.html?highlight=context#abstract-classes">abstract task class</a>.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">myproject.tasks</span> <span class="kn">import</span> <span class="n">app</span>
<span class="k">class</span> <span class="nc">BaseTask</span><span class="p">(</span><span class="n">app</span><span class="o">.</span><span class="n">Task</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Abstract base class for all tasks in my app."""</span>
<span class="n">abstract</span> <span class="o">=</span> <span class="kc">True</span>
<span class="k">def</span> <span class="nf">on_retry</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">exc</span><span class="p">,</span> <span class="n">task_id</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">,</span> <span class="n">einfo</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Log the exceptions to sentry at retry."""</span>
<span class="n">sentrycli</span><span class="o">.</span><span class="n">captureException</span><span class="p">(</span><span class="n">exc</span><span class="p">)</span>
<span class="nb">super</span><span class="p">(</span><span class="n">BaseTask</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="n">on_retry</span><span class="p">(</span><span class="n">exc</span><span class="p">,</span> <span class="n">task_id</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">,</span> <span class="n">einfo</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">on_failure</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">exc</span><span class="p">,</span> <span class="n">task_id</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">,</span> <span class="n">einfo</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Log the exceptions to sentry."""</span>
<span class="n">sentrycli</span><span class="o">.</span><span class="n">captureException</span><span class="p">(</span><span class="n">exc</span><span class="p">)</span>
<span class="nb">super</span><span class="p">(</span><span class="n">BaseTask</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="n">on_failure</span><span class="p">(</span><span class="n">exc</span><span class="p">,</span> <span class="n">task_id</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">,</span> <span class="n">einfo</span><span class="p">)</span>
<span class="nd">@app</span><span class="o">.</span><span class="n">task</span><span class="p">(</span>
<span class="n">bind</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">max_retries</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
<span class="n">soft_time_limit</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
<span class="n">base</span><span class="o">=</span><span class="n">BaseTask</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">send_mail</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">recipients</span><span class="p">,</span> <span class="n">sender_email</span><span class="p">,</span> <span class="n">subject</span><span class="p">,</span> <span class="n">body</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Send a plaintext email with argument subject, sender and body to a list of recipients."""</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">api_send_mail</span><span class="p">(</span><span class="n">recipients</span><span class="p">,</span> <span class="n">sender_email</span><span class="p">,</span> <span class="n">subject</span><span class="p">,</span> <span class="n">body</span><span class="p">)</span>
<span class="k">except</span> <span class="n">InvalidUserInput</span><span class="p">:</span>
<span class="k">raise</span>
<span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">exc</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">retry</span><span class="p">(</span><span class="n">countdown</span><span class="o">=</span><span class="n">backoff</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">request</span><span class="o">.</span><span class="n">retries</span><span class="p">),</span> <span class="n">exc</span><span class="o">=</span><span class="n">exc</span><span class="p">)</span>
<span class="k">return</span> <span class="n">data</span>
</code></pre></div>
<p>You can see that the <code>send_mail</code> task implementation only deals with email sending and expected error handling. Everything else is handled by the abstract base class. If the common behavior is more complex, this trick can <em>drastically</em> reduce the size of each task body and the amount of duplicated code in your tasks.</p>
<p><strong>Note</strong>: this example is only here to demonstrate how to share behavior between tasks. To properly integrate Sentry with Celery, have a look at <a href="https://docs.getsentry.com/hosted/clients/python/integrations/celery/">this page</a>.</p>
<p><strong>Tip</strong>: have a look at the list of <a href="https://celery.readthedocs.org/en/latest/userguide/tasks.html?highlight=context#handlers">available handlers</a>, to get an idea of what behavior can be shared between tasks.</p>
<h2 id="write-large-tasks-as-classes">Write large tasks as classes</h2>
<p>So far, I've only implemented tasks as functions. However, it's also possible to define <a href="https://celery.readthedocs.org/en/latest/userguide/tasks.html#custom-task-classes">class tasks</a>.</p>
<p>I think one of the scenarii where class tasks really shine are when you'd like to split a large task function into several well-defined and testable methods. As you can see <a href="https://celery.readthedocs.org/en/latest/userguide/tasks.html#custom-task-classes">here</a>, the <code>celery.task</code> decorator will generate a task class and inject the decorated function as the class <code>run</code> method.
Defining a class task amounts to defining a class inheriting from <code>app.Task</code> with a <code>run</code> method.</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">handle_event</span><span class="p">(</span><span class="n">BaseTask</span><span class="p">):</span> <span class="c1"># BaseTask inherits from app.Task</span>
<span class="k">def</span> <span class="nf">validate_input</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">event</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">get_or_create_model</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">event</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">stream_event</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">event</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">event</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="o">.</span><span class="n">validate_intput</span><span class="p">(</span><span class="n">event</span><span class="p">):</span>
<span class="k">raise</span> <span class="n">InvalidInput</span><span class="p">(</span><span class="n">event</span><span class="p">)</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">model</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">get_or_create_model</span><span class="p">(</span><span class="n">event</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">call_hooks</span><span class="p">(</span><span class="n">event</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">persist_model</span><span class="p">(</span><span class="n">event</span><span class="p">)</span>
<span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">exc</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">retry</span><span class="p">(</span><span class="n">countdown</span><span class="o">=</span><span class="n">backoff</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">request</span><span class="o">.</span><span class="n">retries</span><span class="p">),</span> <span class="n">exc</span><span class="o">=</span><span class="n">exc</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">stream_event</span><span class="p">(</span><span class="n">event</span><span class="p">)</span>
</code></pre></div>
<p>By doing this, the task logic is clear and easy to follow (the <code>run</code> method stays concise even if the methods body are large), and each of these method can then be unit-tested independently.</p>
<p>Another advantage of using class tasks is using multiple inheritance to specialize a task with multiple abstract base classes.
For example, I'd like to use the <a href="https://github.com/TrackMaven/celery-once/">celery_once</a> <code>QueueOnce</code> abstract class to introduce some locking mechanism, while still using the <code>BaseTask</code> for sentry logging. This way, each abstract task class is used as a mixin, adding some behaviour to the task.</p>
<h2 id="unit-test-your-tasks">Unit-test your tasks</h2>
<p>Unit testing a project involving celery has always been a pickle for me. I tried to deploy a broker and a test celery worker in the CI environment, but it felt like killing a fly with a bazooka. The answer turns out to be quite simple, thanks to Nicolas Le Manchet for figuring this one out! When the <a href="https://celery.readthedocs.org/en/latest/configuration.html#celery-always-eager"><code>CELERY_ALWAYS_EAGER</code></a> option is activated, all tasks called using their <code>apply_async</code> or <code>delay</code> method are called <em>directly</em>, without requiring any broker or celery worker. Easy as pie.</p>My n-step plan to become a better programmer2015-05-24T00:00:00+02:002015-05-24T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2015-05-24:/my-n-step-plan-to-become-a-better-programmer<p>One of the main selling points of Python are its multi-paradigm philosophy. You can code in imperative, object-oriented or aspect-oriented style, use meta-programming techniques, etc. It also has an immense amount of libraries available. Finally, it's both a simple language to pick up for beginners, and a powerful language for …</p><p>One of the main selling points of Python are its multi-paradigm philosophy. You can code in imperative, object-oriented or aspect-oriented style, use meta-programming techniques, etc. It also has an immense amount of libraries available. Finally, it's both a simple language to pick up for beginners, and a powerful language for more experienced programmers.</p>
<p>I've been programming for the last six or seven years, and I feel that my main strength is also my main weakness: I've been mainly coding in Python since the beginning. It means that I can now use Python's features and standard library pretty well, but it also means that I tend to think of every problem in terms of Python features and libraries (standard or not).</p>
<p>A proverb programmers are taught quite early is</p>
<blockquote>
<p>If all you have is a hammer, everything looks like a nail.
(<a href="https://en.wiktionary.org/wiki/if_all_you_have_is_a_hammer,_everything_looks_like_a_nail">Source</a>)</p>
</blockquote>
<p>It means that if you're only comfortable with a single tool, then you'll try to use it in every situation, even in one where it's not appropriate. I strongly feel that to become a better programmer, I now need to learn other programming languages and even other paradigms. I was initially thinking of functional languages, like <a href="https://www.haskell.org/">Haskell</a> or <a href="http://ocaml.org/">oCaml</a>, but then I remembered something <a href="http://blaag.haard.se/">Fredrik</a> told me a while ago, at a EuroPython after-party: reading "Structure and Interpretation of Computer Programs" immediately made him a better programmer. I remember being curious as to why.</p>
<p>It so happens that the books is written under a Creative Common license, and can be downloaded <a href="https://github.com/ieure/sicp/downloads">here</a>, AND uses <a href="https://en.wikipedia.org/wiki/Scheme_%28programming_language%29">Scheme</a> as a teaching language. It thus combines three things I strive for: a new language, a new programming paradigm and more insight into the art of programming itself.</p>
<p>I'm thus laying out my n-step plan to become a better programmer:</p>
<ol>
<li>Read the book thoroughly</li>
<li>Solve the exercises</li>
<li>Stop conceiving every solution in Python</li>
</ol>
<p>Behold, one of my first Scheme programs, a pavement in the road of my improvement.</p>
<div class="highlight"><pre><span></span><code><span class="c1">; Implementation of cubic root Newton approximation technique in Scheme</span>
<span class="p">(</span><span class="k">define</span><span class="w"> </span><span class="p">(</span><span class="nf">square</span><span class="w"> </span><span class="nv">x</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="nb">*</span><span class="w"> </span><span class="nv">x</span><span class="w"> </span><span class="nv">x</span><span class="p">))</span>
<span class="p">(</span><span class="k">define</span><span class="w"> </span><span class="p">(</span><span class="nf">cubic-root</span><span class="w"> </span><span class="nv">x</span><span class="p">)</span>
<span class="w"> </span><span class="p">(</span><span class="k">define</span><span class="w"> </span><span class="p">(</span><span class="nf">improve</span><span class="w"> </span><span class="nv">guess</span><span class="p">)</span>
<span class="w"> </span><span class="p">(</span><span class="nb">/</span><span class="w"> </span><span class="p">(</span><span class="nb">+</span><span class="w"> </span><span class="p">(</span><span class="nb">/</span><span class="w"> </span><span class="nv">x</span><span class="w"> </span><span class="p">(</span><span class="nf">square</span><span class="w"> </span><span class="nv">guess</span><span class="p">))</span><span class="w"> </span><span class="p">(</span><span class="nb">*</span><span class="w"> </span><span class="nv">guess</span><span class="w"> </span><span class="mi">2</span><span class="p">))</span><span class="w"> </span><span class="mi">3</span><span class="p">))</span>
<span class="w"> </span><span class="p">(</span><span class="k">define</span><span class="w"> </span><span class="p">(</span><span class="nf">good-enough?</span><span class="w"> </span><span class="nv">new-guess</span><span class="w"> </span><span class="nv">old-guess</span><span class="p">)</span>
<span class="w"> </span><span class="p">(</span><span class="nb"><</span><span class="w"> </span><span class="p">(</span><span class="nb">abs</span><span class="w"> </span><span class="p">(</span><span class="nb">/</span><span class="w"> </span><span class="p">(</span><span class="nb">-</span><span class="w"> </span><span class="nv">new-guess</span><span class="w"> </span><span class="nv">old-guess</span><span class="p">)</span><span class="w"> </span><span class="nv">old-guess</span><span class="p">))</span><span class="w"> </span><span class="mf">0.001</span><span class="p">))</span>
<span class="w"> </span><span class="p">(</span><span class="k">define</span><span class="w"> </span><span class="p">(</span><span class="nf">try</span><span class="w"> </span><span class="nv">new-guess</span><span class="w"> </span><span class="nv">old-guess</span><span class="p">)</span>
<span class="w"> </span><span class="p">(</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nf">good-enough?</span><span class="w"> </span><span class="nv">new-guess</span><span class="w"> </span><span class="nv">old-guess</span><span class="p">)</span>
<span class="w"> </span><span class="nv">new-guess</span>
<span class="w"> </span><span class="p">(</span><span class="nf">try</span><span class="w"> </span><span class="p">(</span><span class="nf">improve</span><span class="w"> </span><span class="nv">new-guess</span><span class="p">)</span><span class="w"> </span><span class="nv">new-guess</span><span class="p">)))</span>
<span class="w"> </span><span class="p">(</span><span class="nf">try</span><span class="w"> </span><span class="mf">1.0</span><span class="w"> </span><span class="nv">x</span><span class="p">)</span>
<span class="p">)</span>
<span class="p">(</span><span class="nf">cubic-root</span><span class="w"> </span><span class="mi">9</span><span class="p">)</span>
<span class="c1">; => 2.0800838232385224</span>
</code></pre></div>
<p>Note: If you want to experiment with various languages (Scheme included) without having to install them on your machine, have a look at <a href="http://repl.it/languages">repl.it</a>.</p>Crawl a website with scrapy2012-04-23T00:00:00+02:002012-04-23T00:00:00+02:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2012-04-23:/crawl-a-website-with-scrapy<p>In this article, we are going to see how to scrape information from a website, in particular, from all pages with a common URL pattern. We will see how to do that with <a href="http://scrapy.org/">Scrapy</a>, a very powerful, and yet simple, scraping and web-crawling framework.</p>
<p>For example, you might be interested …</p><p>In this article, we are going to see how to scrape information from a website, in particular, from all pages with a common URL pattern. We will see how to do that with <a href="http://scrapy.org/">Scrapy</a>, a very powerful, and yet simple, scraping and web-crawling framework.</p>
<p>For example, you might be interested in scraping information about each article of a blog, and store it information in a database. To achieve such a thing, we will see how to implement a simple <a href="https://en.wikipedia.org/wiki/Web_crawler">spider</a> using <a href="http://scrapy.org/">Scrapy</a>, which will crawl the blog and store the extracted data into a <a href="http://www.mongodb.org/">MongoDB</a> database.</p>
<p>We will consider that you have a <a href="http://www.mongodb.org/display/DOCS/Quickstart">working MongoDB server</a>, and that you have installed the <code>pymongo</code> and <code>scrapy</code> python packages, both installable with <a href="http://pypi.python.org/pypi/pip"><code>pip</code></a>.</p>
<p>If you have never toyed around with <a href="http://scrapy.org/">Scrapy</a>, you should first read this <a href="http://doc.scrapy.org/en/latest/intro/tutorial.html">short tutorial</a>.</p>
<h2 id="first-step-identify-the-url-patterns">First step, identify the URL pattern(s)</h2>
<p>In this example, we’ll see how to extract the following information from each <a href="http://isbullsh.it">isbullsh.it</a> blogpost :</p>
<ul>
<li>title</li>
<li>author</li>
<li>tag</li>
<li>release date</li>
<li>url</li>
</ul>
<p>We’re lucky, all posts have the same URL pattern: <code>http://isbullsh.it/YYYY/MM/title</code>. These links can be found in the different pages of the site homepage.</p>
<p>What we need is a spider which will follow all links following this pattern, scrape the required information from the target webpage, validate the data integrity, and populate a MongoDB collection.</p>
<h2 id="building-the-spider">Building the spider</h2>
<p>We create a Scrapy project, following the instructions from their <a href="http://doc.scrapy.org/en/latest/intro/tutorial.html">tutorial</a>. We obtain the following project structure:</p>
<div class="highlight"><pre><span></span><code>isbullshit_scraping/
├── isbullshit
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ ├── isbullshit_spiders.py
└── scrapy.cfg
</code></pre></div>
<p>We begin by defining, in <code>items.py</code>, the item structure which will contain the extracted information:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">scrapy.item</span> <span class="kn">import</span> <span class="n">Item</span><span class="p">,</span> <span class="n">Field</span>
<span class="k">class</span> <span class="nc">IsBullshitItem</span><span class="p">(</span><span class="n">Item</span><span class="p">):</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">Field</span><span class="p">()</span>
<span class="n">author</span> <span class="o">=</span> <span class="n">Field</span><span class="p">()</span>
<span class="n">tag</span> <span class="o">=</span> <span class="n">Field</span><span class="p">()</span>
<span class="n">date</span> <span class="o">=</span> <span class="n">Field</span><span class="p">()</span>
<span class="n">link</span> <span class="o">=</span> <span class="n">Field</span><span class="p">()</span>
</code></pre></div>
<p>Now, let’s implement our spider, in <code>isbullshit_spiders.py</code>:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">scrapy.contrib.spiders</span> <span class="kn">import</span> <span class="n">CrawlSpider</span><span class="p">,</span> <span class="n">Rule</span>
<span class="kn">from</span> <span class="nn">scrapy.contrib.linkextractors.sgml</span> <span class="kn">import</span> <span class="n">SgmlLinkExtractor</span>
<span class="kn">from</span> <span class="nn">scrapy.selector</span> <span class="kn">import</span> <span class="n">HtmlXPathSelector</span>
<span class="kn">from</span> <span class="nn">isbullshit.items</span> <span class="kn">import</span> <span class="n">IsBullshitItem</span>
<span class="k">class</span> <span class="nc">IsBullshitSpider</span><span class="p">(</span><span class="n">CrawlSpider</span><span class="p">):</span>
<span class="n">name</span> <span class="o">=</span> <span class="s1">'isbullshit'</span>
<span class="n">start_urls</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'http://isbullsh.it'</span><span class="p">]</span> <span class="c1"># urls from which the spider will start crawling</span>
<span class="n">rules</span> <span class="o">=</span> <span class="p">[</span><span class="n">Rule</span><span class="p">(</span><span class="n">SgmlLinkExtractor</span><span class="p">(</span><span class="n">allow</span><span class="o">=</span><span class="p">[</span><span class="sa">r</span><span class="s1">'page/\d+'</span><span class="p">]),</span> <span class="n">follow</span><span class="o">=</span><span class="kc">True</span><span class="p">),</span>
<span class="c1"># r'page/\d+' : regular expression for http://isbullsh.it/page/X URLs</span>
<span class="n">Rule</span><span class="p">(</span><span class="n">SgmlLinkExtractor</span><span class="p">(</span><span class="n">allow</span><span class="o">=</span><span class="p">[</span><span class="sa">r</span><span class="s1">'\d</span><span class="si">{4}</span><span class="s1">/\d</span><span class="si">{2}</span><span class="s1">/\w+'</span><span class="p">]),</span> <span class="n">callback</span><span class="o">=</span><span class="s1">'parse_blogpost'</span><span class="p">)]</span>
<span class="c1"># r'\d{4}/\d{2}/\w+' : regular expression for http://isbullsh.it/YYYY/MM/title URLs</span>
<span class="k">def</span> <span class="nf">parse_blogpost</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">response</span><span class="p">):</span>
<span class="o">...</span>
</code></pre></div>
<p>Our spider inherits from <code>CrawlSpider</code>, which “provides a convenient mechanism for following links by defining a set of rules”. More info <a href="http://isbullsh.it/2012/04/Web-crawling-with-scrapy/readthedocs.org/docs/scrapy/en/0.14/topics/spiders.html#crawlspider">here</a>.</p>
<p>We then define two simple rules:</p>
<ul>
<li>Follow links pointing to <code>http://isbullsh.it/page/X</code></li>
<li>Extract information from pages defined by a URL of pattern <code>http://isbullsh.it/YYYY/MM/title</code>, using the callback method <code>parse_blogpost</code>.</li>
</ul>
<h2 id="extracting-the-data">Extracting the data</h2>
<p>To extract the title, author, etc, from the HTML code, we’ll use the <code>scrapy.selector.HtmlXPathSelector object</code>, which uses the <code>libxml2</code> HTML parser. If you’re not familiar with this object, you should read the <code>XPathSelector</code> <a href="http://readthedocs.org/docs/scrapy/en/0.14/topics/selectors.html#using-selectors-with-xpaths">documentation</a>.</p>
<p>We’ll now define the extraction logic in the <code>parse_blogpost</code> method (I’ll only define it for the title and tag(s), it’s pretty much always the same logic):</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">parse_blogpost</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">response</span><span class="p">):</span>
<span class="n">hxs</span> <span class="o">=</span> <span class="n">HtmlXPathSelector</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
<span class="n">item</span> <span class="o">=</span> <span class="n">IsBullshitItem</span><span class="p">()</span>
<span class="c1"># Extract title</span>
<span class="n">item</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span> <span class="o">=</span> <span class="n">hxs</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'//header/h1/text()'</span><span class="p">)</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span> <span class="c1"># XPath selector for title</span>
<span class="c1"># Extract author</span>
<span class="n">item</span><span class="p">[</span><span class="s1">'tag'</span><span class="p">]</span> <span class="o">=</span> <span class="n">hxs</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">"//header/div[@class='post-data']/p/a/text()"</span><span class="p">)</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span> <span class="c1"># Xpath selector for tag(s)</span>
<span class="o">...</span>
<span class="k">return</span> <span class="n">item</span>
</code></pre></div>
<p><strong>Note</strong>: To be sure of the XPath selectors you define, I’d advise you to use Firebug, Firefox Inspect, or equivalent, to inspect the HTML code of a page, and then test the selector in a <a href="http://doc.scrapy.org/en/latest/intro/tutorial.html#trying-selectors-in-the-shell">Scrapy shell</a>. That only works if the data position is coherent throughout all the pages you crawl.</p>
<h2 id="store-the-results-in-mongodb">Store the results in MongoDB</h2>
<p>Each time that the <code>parse_blogspot</code> method returns an item, we want it to be sent to a pipeline which will validate the data, and store everything in our Mongo collection.</p>
<p>First, we need to add a couple of things to <code>settings.py</code>:</p>
<div class="highlight"><pre><span></span><code><span class="n">ITEM_PIPELINES</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'isbullshit.pipelines.MongoDBPipeline'</span><span class="p">,]</span>
<span class="n">MONGODB_SERVER</span> <span class="o">=</span> <span class="s2">"localhost"</span>
<span class="n">MONGODB_PORT</span> <span class="o">=</span> <span class="mi">27017</span>
<span class="n">MONGODB_DB</span> <span class="o">=</span> <span class="s2">"isbullshit"</span>
<span class="n">MONGODB_COLLECTION</span> <span class="o">=</span> <span class="s2">"blogposts"</span>
</code></pre></div>
<p>Now that we’ve defined our pipeline, our MongoDB database and collection, we’re just left with the pipeline implementation. We just want to be sure that we do not have any missing data (ex: a blogpost without a title, author, etc).</p>
<p>Here is our <code>pipelines.py</code> file :</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pymongo</span>
<span class="kn">from</span> <span class="nn">scrapy.exceptions</span> <span class="kn">import</span> <span class="n">DropItem</span>
<span class="kn">from</span> <span class="nn">scrapy.conf</span> <span class="kn">import</span> <span class="n">settings</span>
<span class="kn">from</span> <span class="nn">scrapy</span> <span class="kn">import</span> <span class="n">log</span>
<span class="k">class</span> <span class="nc">MongoDBPipeline</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">connection</span> <span class="o">=</span> <span class="n">pymongo</span><span class="o">.</span><span class="n">Connection</span><span class="p">(</span><span class="n">settings</span><span class="p">[</span><span class="s1">'MONGODB_SERVER'</span><span class="p">],</span> <span class="n">settings</span><span class="p">[</span><span class="s1">'MONGODB_PORT'</span><span class="p">])</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">connection</span><span class="p">[</span><span class="n">settings</span><span class="p">[</span><span class="s1">'MONGODB_DB'</span><span class="p">]]</span>
<span class="bp">self</span><span class="o">.</span><span class="n">collection</span> <span class="o">=</span> <span class="n">db</span><span class="p">[</span><span class="n">settings</span><span class="p">[</span><span class="s1">'MONGODB_COLLECTION'</span><span class="p">]]</span>
<span class="k">def</span> <span class="nf">process_item</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">item</span><span class="p">,</span> <span class="n">spider</span><span class="p">):</span>
<span class="n">valid</span> <span class="o">=</span> <span class="kc">True</span>
<span class="k">for</span> <span class="n">data</span> <span class="ow">in</span> <span class="n">item</span><span class="p">:</span>
<span class="c1"># here we only check if the data is not null</span>
<span class="c1"># but we could do any crazy validation we want</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">data</span><span class="p">:</span>
<span class="n">valid</span> <span class="o">=</span> <span class="kc">False</span>
<span class="k">raise</span> <span class="n">DropItem</span><span class="p">(</span><span class="s2">"Missing </span><span class="si">%s</span><span class="s2"> of blogpost from </span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">item</span><span class="p">[</span><span class="s1">'url'</span><span class="p">]))</span>
<span class="k">if</span> <span class="n">valid</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">collection</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span><span class="n">item</span><span class="p">))</span>
<span class="n">log</span><span class="o">.</span><span class="n">msg</span><span class="p">(</span><span class="s2">"Item wrote to MongoDB database </span><span class="si">%s</span><span class="s2">/</span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span>
<span class="p">(</span><span class="n">settings</span><span class="p">[</span><span class="s1">'MONGODB_DB'</span><span class="p">],</span> <span class="n">settings</span><span class="p">[</span><span class="s1">'MONGODB_COLLECTION'</span><span class="p">]),</span>
<span class="n">level</span><span class="o">=</span><span class="n">log</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">,</span> <span class="n">spider</span><span class="o">=</span><span class="n">spider</span><span class="p">)</span>
<span class="k">return</span> <span class="n">item</span>
</code></pre></div>
<h2 id="release-the-spider">Release the spider!</h2>
<p>Now, all we have to do is change directory to the root of our project and execute</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>scrapy<span class="w"> </span>crawl<span class="w"> </span>isbullshit
</code></pre></div>
<p>The spider will then follow all links pointing to a blogpost, retrieve the post title, author name, date, etc, validate the extracted data, and store all that in a MongoDB collection if validation went well.</p>
<p>Pretty neat, hm?</p>
<h2 id="conclusion">Conclusion</h2>
<p>This case is pretty simplistic: all URLs have a similar pattern and all links are hard written in the HTML code: there is no JS involved. In the case were the links you want to reach are generated by JS, you’d probably want to check out <a href="http://pypi.python.org/pypi/selenium">Selenium</a>. You could complexify the spider by adding new rules, or more complicated regular expressions, but I just wanted to demo how Scrapy worked, not getting into crazy regex explanations.</p>
<p>Also, be aware that sometimes, there’s a thin line bewteen playing with web-scraping and <a href="https://en.wikipedia.org/wiki/Web_scraping#Legal_issues">getting into trouble</a>.</p>
<p>Finally, when toying with web-crawling, keep in mind that you might just flood the server with requests, which can sometimes get you IP-blocked :)</p>
<p>The entire code of this project is hosted on <a href="https://github.com/BaltoRouberol/isbullshit-crawler">Github</a>. Help yourselves!</p>Create a webcam manager using pyGTK and Gstreamer2012-02-29T00:00:00+01:002012-02-29T00:00:00+01:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2012-02-29:/create-a-webcam-manager-using-pygtk-and-gstreamer<h2 id="introduction">Introduction</h2>
<p>I recently joined the <a href="http://www.strongsteam.com">Strongsteam</a> project for a 6 month internship. Our main goal is to provide some <em>"artificial intelligence and
data mining APIs to let you pull interesting information out of images, video and audio."</em>
We will be doing a presentation at <a href="https://us.pycon.org/2012/">Pycon 2012</a>, the 9th of March …</p><h2 id="introduction">Introduction</h2>
<p>I recently joined the <a href="http://www.strongsteam.com">Strongsteam</a> project for a 6 month internship. Our main goal is to provide some <em>"artificial intelligence and
data mining APIs to let you pull interesting information out of images, video and audio."</em>
We will be doing a presentation at <a href="https://us.pycon.org/2012/">Pycon 2012</a>, the 9th of March, during the <a href="https://us.pycon.org/2012/community/startuprow/">Startup Row weekend</a>.
On this occasion, I had to implement a desktop GUI allowing to display a webcam video stream and to capture snapshots, with the following constraints:</p>
<ul>
<li>GUI written with <a href="http://wxpython.org/">wxPython</a> or <a href="http://www.pygtk.org/">pyGTK</a></li>
<li>the webcam stream must be integrated in the wxPython/pyGTK window</li>
<li>the webcam must not be handled with the <a href="http://opencv.willowgarage.com/wiki/PythonInterface">OpenCV</a> python module (the installation can be painful on Mac OS X)</li>
<li>the snapshots default format and resolution must be JPG and 640x480px</li>
</ul>
<h2 id="how-to-handle-the-webcam">How to handle the webcam ?</h2>
<p>My initial research led me to consider two different solutions:</p>
<ul>
<li>using <a href="http://www.pygame.org">PyGame</a>, a set of python modules adding functionality on top of the <a href="http://www.libsdl.org/">SDL</a> library</li>
<li>using <a href="http://gstreamer.freedesktop.org/">Gstreamer</a>, a pipeline-based multimedia framework allowing <em>"to create a variety of media-handling components, including simple audio playback, audio and video playback, recording, streaming and editing"</em> (quote: wikipedia article). Gstreamer is used by a bunch of multimedia applications, like <a href="https://live.gnome.org/Cheese">Cheese</a>, <a href="http://amarok.kde.org/">Amarok</a>, <a href="http://pitivi.sourceforge.net/">Pitivi</a>, ...</li>
</ul>
<p>I quickly turned to PyGame, because of the simplicify of the snapshot operation : all we have to do is to use the <a href="http://www.pygame.org/docs/ref/camera.html#pygame.camera.Camera"><code>pygame.camera.Camera.get_image()</code></a> function. However, the integration of the PyGame surface into a pyGTK interface turned out to be pretty complicated. I found a couple of <a href="http://stackoverflow.com/questions/25661/pygame-within-a-pygtk-application">StackOverflow posts</a> stating that even though this integration was possible, it was not advised. Indeed, some erratic behaviours seem to be observed when using different OS.</p>
<p>I thus considered Gstreamer, and quicky found this <a href="http://pygstdocs.berlios.de/pygst-tutorial/webcam-viewer.html">encouraging project</a>. This code allowed to start and stop a webcam video stream embedded in a pyGTK interface : I was definitlely in the right place !</p>
<h2 id="why-doesnt-it-work-with-my-webcam">Why doesn't it work with my webcam ?</h2>
<p>If you experience some problems testing the project introduced into the previous part (black screen, first run successful and following run leading to black screen, ...) check if your webcam is UVC (USB Video Class) Linux compliant. To do that, type in</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span>lsusb
</code></pre></div>
<p>in a terminal and locate the line describing your webcam.</p>
<p>My laptop integrated webcam was described as <code>Bus 001 Device 003: ID 05ca:1814 Ricoh Co., Ltd HD Webcam</code>. The reference <code>05ca:1814</code> doesn't appear on the <a href="http://www.ideasonboard.org/uvc/">UVC</a> website. That could explain why I experienced so many problems with it (it appears that Ricoh webcams are poorly UVC compliant).</p>
<p>I hence bought a Logitech QuickCam Pro 9000, known for being well supported. Everything ran smoothly with this one.</p>
<h2 id="how-to-use-gstreamer">How to use Gstreamer ?</h2>
<p>If you don't know how to use Gstreamer, I'd advise you to have a look these pages :</p>
<ul>
<li><a href="http://wiki.oz9aec.net/index.php/Gstreamer_Cheat_Sheet">Gstreamer cheat sheet</a></li>
<li><a href="http://www.oz9aec.net/index.php/gstreamer/345-a-weekend-with-gstreamer">A weekend with Gstreamer</a></li>
</ul>
<p>The main idea is to construct a <strong>pipeline</strong>, by connecting various data sources, sinks and processing blocks (bins) in a data flow graph.</p>
<p>In our case, we are going to use the following pipeline to display the webcam stream:</p>
<blockquote>
<p><code>v4l2src ! video/x-raw-yuv,width=640,height=480,framerate=30/1 ! xvimagesink</code></p>
</blockquote>
<ul>
<li><code>v4l2src</code> : Video for Linux input : your webcam (the default device is <code>/dev/video0</code>, but if you are using an external webcam, use <code>v4l2src device=/dev/video1</code>)</li>
<li><code>video/x-raw-yuv</code> : video colorspace specific to webcam</li>
<li><code>width=640,height=480</code> : your webcam resolution (check that it is compatible with your webcam)</li>
<li><code>framerate=30/1</code> : number of frames per second</li>
<li><code>xvimagesink</code> : video sink</li>
</ul>
<p>Let's see how to do that in Python:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">create_video_pipeline</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""Set up the video pipeline and the communication bus bewteen the video stream and gtk DrawingArea """</span>
<span class="n">video_pipeline</span> <span class="o">=</span> <span class="s1">'v4l2src device=/dev/video1 ! video/x-raw-yuv,width=640,height=480,framerate=30/1 ! xvimagesink'</span>
<span class="bp">self</span><span class="o">.</span><span class="n">video_player</span> <span class="o">=</span> <span class="n">gst</span><span class="o">.</span><span class="n">parse_launch</span><span class="p">(</span><span class="n">video_pipeline</span><span class="p">)</span> <span class="c1"># create pipeline</span>
<span class="bp">self</span><span class="o">.</span><span class="n">video_player</span><span class="o">.</span><span class="n">set_state</span><span class="p">(</span><span class="n">gst</span><span class="o">.</span><span class="n">STATE_PLAYING</span><span class="p">)</span> <span class="c1"># start video stream</span>
<span class="n">bus</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">video_player</span><span class="o">.</span><span class="n">get_bus</span><span class="p">()</span>
<span class="n">bus</span><span class="o">.</span><span class="n">add_signal_watch</span><span class="p">()</span>
<span class="n">bus</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="s2">"message"</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">on_message</span><span class="p">)</span>
<span class="n">bus</span><span class="o">.</span><span class="n">enable_sync_message_emission</span><span class="p">()</span>
<span class="n">bus</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="s2">"sync-message::element"</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">on_sync_message</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">on_message</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">bus</span><span class="p">,</span> <span class="n">message</span><span class="p">):</span>
<span class="w"> </span><span class="sd">""" Gst message bus. Closes the pipeline in case of error or end of stream message """</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">message</span><span class="o">.</span><span class="n">type</span>
<span class="k">if</span> <span class="n">t</span> <span class="o">==</span> <span class="n">gst</span><span class="o">.</span><span class="n">MESSAGE_EOS</span><span class="p">:</span>
<span class="nb">print</span> <span class="s2">"MESSAGE EOS"</span>
<span class="bp">self</span><span class="o">.</span><span class="n">video_player</span><span class="o">.</span><span class="n">set_state</span><span class="p">(</span><span class="n">gst</span><span class="o">.</span><span class="n">STATE_NULL</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">t</span> <span class="o">==</span> <span class="n">gst</span><span class="o">.</span><span class="n">MESSAGE_ERROR</span><span class="p">:</span>
<span class="nb">print</span> <span class="s2">"MESSAGE ERROR"</span>
<span class="n">err</span><span class="p">,</span> <span class="n">debug</span> <span class="o">=</span> <span class="n">message</span><span class="o">.</span><span class="n">parse_error</span><span class="p">()</span>
<span class="nb">print</span> <span class="s2">"Error: </span><span class="si">%s</span><span class="s2">"</span> <span class="o">%</span> <span class="n">err</span><span class="p">,</span> <span class="n">debug</span>
<span class="bp">self</span><span class="o">.</span><span class="n">video_player</span><span class="o">.</span><span class="n">set_state</span><span class="p">(</span><span class="n">gst</span><span class="o">.</span><span class="n">STATE_NULL</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">on_sync_message</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">bus</span><span class="p">,</span> <span class="n">message</span><span class="p">):</span>
<span class="w"> </span><span class="sd">""" Set up the Webcam <--> GUI messages bus """</span>
<span class="k">if</span> <span class="n">message</span><span class="o">.</span><span class="n">structure</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
<span class="k">return</span>
<span class="n">message_name</span> <span class="o">=</span> <span class="n">message</span><span class="o">.</span><span class="n">structure</span><span class="o">.</span><span class="n">get_name</span><span class="p">()</span>
<span class="k">if</span> <span class="n">message_name</span> <span class="o">==</span> <span class="s2">"prepare-xwindow-id"</span><span class="p">:</span>
<span class="c1"># Assign the viewport</span>
<span class="n">imagesink</span> <span class="o">=</span> <span class="n">message</span><span class="o">.</span><span class="n">src</span>
<span class="n">imagesink</span><span class="o">.</span><span class="n">set_property</span><span class="p">(</span><span class="s2">"force-aspect-ratio"</span><span class="p">,</span> <span class="kc">True</span><span class="p">)</span>
<span class="c1"># Sending video stream to gtk DrawingArea</span>
<span class="n">imagesink</span><span class="o">.</span><span class="n">set_xwindow_id</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">movie_window</span><span class="o">.</span><span class="n">window</span><span class="o">.</span><span class="n">xid</span><span class="p">)</span>
</code></pre></div>
<p>Now, we have a live video stream displayed into a pyGTK interface, but still no way of capturing a snapshot.</p>
<h2 id="how-do-we-capture-a-snapshot">How do we capture a snapshot ?</h2>
<p>I encountered many StackOverflow open questions about this part, but no satisfactory answer...</p>
<p>At first, I wanted to use Gstreamer for that too, but I couldn't find any way to dynamically modify the pipeline to add a frame extraction, jpg encoding and a filesink (to save the snapshot). I thus tried this ugly hack : when the <em>'take snapshot'</em> button is clicked</p>
<ul>
<li>stop the video stream</li>
<li>start the following pipeline: <code>v4l2src device=/dev/video1 ! video/x-raw-yuv,width=640,height=480,framerate=30/1 ! ffmpegcolorspace ! video/x-raw-rgb,framerate=1/1 ! ffmpegcolorspace ! jpegenc snapshot=true ! filesink location=snap.jpeg</code>, which will extract a single frame, encode it to jpg and save it to a file.</li>
<li>stop this image pipeline</li>
<li>re-start the video stream</li>
</ul>
<p>That was of course ugly, and resulted into a ~2s flicker when taking the snapshot... Back to square one.</p>
<p>I'll save you the suspens, the right solution is to use the <code>gtk.DrawingArea.window.get_colormap()</code> method, as shown here:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">take_snapshot</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="w"> </span><span class="sd">""" Capture a snapshot from DrawingArea and save it into a image file """</span>
<span class="n">drawable</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">movie_window</span><span class="o">.</span><span class="n">window</span>
<span class="c1"># self.movie_window is of type gtk.DrawingArea()</span>
<span class="n">colormap</span> <span class="o">=</span> <span class="n">drawable</span><span class="o">.</span><span class="n">get_colormap</span><span class="p">()</span>
<span class="n">pixbuf</span> <span class="o">=</span> <span class="n">gtk</span><span class="o">.</span><span class="n">gdk</span><span class="o">.</span><span class="n">Pixbuf</span><span class="p">(</span><span class="n">gtk</span><span class="o">.</span><span class="n">gdk</span><span class="o">.</span><span class="n">COLORSPACE_RGB</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="o">*</span><span class="n">drawable</span><span class="o">.</span><span class="n">get_size</span><span class="p">())</span>
<span class="n">pixbuf</span> <span class="o">=</span> <span class="n">pixbuf</span><span class="o">.</span><span class="n">get_from_drawable</span><span class="p">(</span><span class="n">drawable</span><span class="p">,</span> <span class="n">colormap</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span> <span class="o">*</span><span class="n">drawable</span><span class="o">.</span><span class="n">get_size</span><span class="p">())</span>
<span class="n">pixbuf</span> <span class="o">=</span> <span class="n">pixbuf</span><span class="o">.</span><span class="n">scale_simple</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">W</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">H</span><span class="p">,</span> <span class="n">gtk</span><span class="o">.</span><span class="n">gdk</span><span class="o">.</span><span class="n">INTERP_HYPER</span><span class="p">)</span> <span class="c1"># resize</span>
<span class="c1"># We resize from actual window size to wanted resolution</span>
<span class="c1"># gtk.gdk.INTER_HYPER is the slowest and highest quality reconstruction function</span>
<span class="c1"># More info here : http://developer.gnome.org/pygtk/stable/class-gdkpixbuf.html#method-gdkpixbuf--scale-simple</span>
<span class="n">filename</span> <span class="o">=</span> <span class="s1">'snap.jpg'</span>
<span class="n">filepath</span> <span class="o">=</span> <span class="n">relpath</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
<span class="n">pixbuf</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">snap_format</span><span class="p">)</span>
</code></pre></div>
<p>This snippet does the following operations:</p>
<ul>
<li>extract the last frame from the <code>gtk.DrawingArea</code></li>
<li>encode it to RGB</li>
<li>resize it to 640x480px</li>
<li>save it to <code>snap.jpg</code></li>
</ul>
<p>And that's done, without even a teeny-tiny flicker! Yay! We now have a perfecly functional snapshot operation.</p>
<h2 id="project-source-code-git-repository">Project source code & Git repository</h2>
<p>All the code can be encountered on my <a href="https://github.com/BaltoRouberol/Gstreamer-webcam-tool">GitHub</a>.</p>How to randomly generate a Monty Python parody2011-11-16T00:00:00+01:002011-11-16T00:00:00+01:00Balthazar Rouberoltag:blog.balthazar-rouberol.com,2011-11-16:/how-to-randomly-generate-a-monty-python-parody<p>If you always wanted to write texts in the way of Monty Python, I have what you need !
In this post, I am going to show you mathematical techniques to analyse a text, in order to randomly generate look-alike texts.</p>
<h2 id="introduction-to-basic-concepts">Introduction to basic concepts</h2>
<p>First essential question: what is a …</p><p>If you always wanted to write texts in the way of Monty Python, I have what you need !
In this post, I am going to show you mathematical techniques to analyse a text, in order to randomly generate look-alike texts.</p>
<h2 id="introduction-to-basic-concepts">Introduction to basic concepts</h2>
<p>First essential question: what is a text?</p>
<p>From a mathematical point of view, a text of length <em>n</em> simply is the concatenation of <em>n</em> symbols, all taken from a finite alphabet <em>A</em>.
In our context, the alphabet is generally composed of all lowercase and uppercase letters, punctuation signs, etc.</p>
<p>In a real-life situation, <strong>the symbols sucession is not random, but depends of the previous symbols</strong>. Indeed, if the 3 last symbols are <em>" "</em>, <em>"t"</em> and <em>"h"</em>, it is highly probable that the next one will be <em>"e"</em>, because the world <em>"the"</em> is fairly common.</p>
<p>The whole problem can thus be resumed to obtaining a transition probability matrix between strings of fixed length and all smbols of the alphabet.</p>
<p><em>Example</em> : Let's assume that the three last symbols are <em>" "</em>, <em>"t"</em>, and <em>"h"</em>, and that the probability of the next symbol being <em>"e"</em> (written $p("e" / " th")$ ) is 0.6, an <em>"a"</em> is 0.3 and <em>"u"</em> is 0.1.
We would then obtain a line of the matrix of transition probability between <em>" th"</em> and all alphabet symbols:</p>
<p><em>" th"</em> —> a: 0.3, b: 0, c: 0, ..., e: 0.6, ..., u: 0.1, ...</p>
<p>The probability $p("e" / " th")$ is called a conditional probability.</p>
<h2 id="markov-chain-of-order-k">Markov chain of order $k$</h2>
<p>We are going to model our data text (here, the "Monthy Python and the Holy Grail" script) with a Markov chain of order $k$. This barbarian name refers to :</p>
<blockquote>
<p>"a mathematical system that undergoes transitions from one state to another (from a finite or countable number of possible states) in a chain-like manner
--
<a href="http://en.wikipedia.org/wiki/Markov_chain" title="Wikipedia">Source : Wikipedia</a>"</p>
</blockquote>
<p>That means that the following state is conditioned by the $k$ previous ones.</p>
<p>If we deal with a Markov chain of order 3, the probability of occurence of the next symbol will only depends on the 3 previous symbols. From previous tests, I can say that <strong>$k=10$ is a good place to start</strong>. (More on that later)</p>
<h2 id="text-alphabet">Text Alphabet</h2>
<p>We've just fixed the value of k, which was the first step of the process. Now, we need to to create a list of all encountered symbols (ie: the alphabet).</p>
<p>First, we read the data file, and join all the lines in a single string.</p>
<div class="highlight"><pre><span></span><code><span class="n">f</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'../data/monty.txt'</span><span class="p">)</span>
<span class="n">f_lines</span> <span class="o">=</span> <span class="s1">' '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">readlines</span><span class="p">())</span>
</code></pre></div>
<p>Then, we create the alphabet list:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">alphabet</span><span class="p">(</span><span class="n">datafile_lines</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> Returns all used characters in a given text</span>
<span class="sd"> """</span>
<span class="n">alph</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">letter</span> <span class="ow">in</span> <span class="n">datafile_lines</span><span class="p">:</span>
<span class="k">if</span> <span class="n">letter</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">alph</span><span class="p">:</span>
<span class="n">alph</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">letter</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">alph</span><span class="p">)</span>
</code></pre></div>
<h2 id="finding-all-exiting-k-tuples-in-the-source-text">Finding all exiting K-tuples in the source text</h2>
<p>Now, <strong>we need to identify all distinct strings of length $k=10$ in the text</strong>.</p>
<p>This can seem a bit tedious, but list comprehensions and sets will do a lovely work.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># -- split text in ak chunks of length k</span>
<span class="n">ak_chunks</span> <span class="o">=</span> <span class="p">[</span><span class="n">datafile_lines</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">i</span><span class="o">+</span><span class="n">k</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">xrange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">datafile_lines</span><span class="p">))]</span>
<span class="c1"># -- remove final chunk if not of size k</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">ak_chunks</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="o">!=</span> <span class="n">k</span><span class="p">:</span>
<span class="n">ak_chunks</span><span class="o">.</span><span class="n">remove</span><span class="p">(</span><span class="n">ak_chunks</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="c1"># -- Extract unique values from list</span>
<span class="n">ak_chunks</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">ak_chunks</span><span class="p">))</span> <span class="c1">#set: reduce to unique values</span>
</code></pre></div>
<h2 id="empirical-probabilities-of-transition">Empirical probabilities of transition</h2>
<p>Now comes the hard work. So far, we have</p>
<ul>
<li>a text,</li>
<li>its alphabet,</li>
<li>a HUGE list of all distincts strings of length $k=10$ contained in the text</li>
</ul>
<p>What we then need is a way to calculate the empirical probability of transition between each string of length 10 and symbols of the alphabet ("empirical" in the way that these probabilities will only apply to the text we study).</p>
<p>Let's formalize a bit the problem:</p>
<ul>
<li>$a^k$ : string of length $k$ (here, 10)</li>
<li>$b$ : symbol located after $a^k$</li>
<li>$n_(a^k)$ : number of times that the string $a^k$ is encountered in the text</li>
<li>$n_(b/a^k)$ = number of times that the string $a^k$ is followed by the symbol $b$</li>
</ul>
<p>We can now express the empirical probability $p(b/a^k) = n_(b/a^k) / n_(a^k)$
(number of times that the string $a^k$ is followed by the symbol $b$ / number of times that the string $a^k$ is encountered in the text)</p>
<p><em>Example</em> : if our text is ABCABDABC, $a^k = AB$ and $b = C$:</p>
<ul>
<li>$n_(AB) = 3$</li>
<li>$n_(C/AB) = 2$</li>
<li>$p(C/AB) = 2/3 = 0.667$</li>
</ul>
<p>Let's write all that in Python:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">conditional_empirical_proba</span><span class="p">(</span><span class="n">chain</span><span class="p">,</span> <span class="n">ak</span><span class="p">,</span> <span class="n">symbol</span><span class="p">,</span> <span class="n">n_ak</span><span class="p">):</span> <span class="c1"># p(b/a^k)</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> Returns the proportion of symbols after the ak string (contained</span>
<span class="sd"> in chain string and of length k) which are equal to the value</span>
<span class="sd"> of given parameter 'symbol'</span>
<span class="sd"> Ex:conditional_empirical_proba('ABCABD', 2, 'AB', 'C', n_ak)-> 0.5</span>
<span class="sd"> """</span>
<span class="n">nb_ak</span> <span class="o">=</span> <span class="n">n_b_ak</span><span class="p">(</span><span class="n">chain</span><span class="p">,</span> <span class="n">ak</span><span class="p">,</span> <span class="n">symbol</span><span class="p">)</span>
<span class="k">if</span> <span class="n">n_ak</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">float</span><span class="p">(</span><span class="n">nb_ak</span><span class="p">)</span><span class="o">/</span><span class="n">n_ak</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">0</span>
<span class="k">def</span> <span class="nf">n_b_ak</span><span class="p">(</span><span class="n">chain</span><span class="p">,</span> <span class="n">ak</span><span class="p">,</span> <span class="n">symbol</span><span class="p">):</span> <span class="c1"># n_(b/a^k)</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> Given a string chain, returns the number of</span>
<span class="sd"> times that a given symbol is found</span>
<span class="sd"> right after a string ak inside the chain</span>
<span class="sd"> """</span>
<span class="k">return</span> <span class="n">chain</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="n">ak</span><span class="o">+</span><span class="n">symbol</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">n_ak</span><span class="p">(</span><span class="n">chain</span><span class="p">,</span> <span class="n">ak</span><span class="p">):</span> <span class="c1"># n_(a^k)</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> Given a string chain and a string ak, returns</span>
<span class="sd"> the number of times ak is found in chain</span>
<span class="sd"> """</span>
<span class="k">return</span> <span class="n">chain</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="n">ak</span><span class="p">)</span>
</code></pre></div>
<p>Now, the only remaning thing to do is to calculate the empirical conditional probability for each k-tuple and for each symbol.</p>
<p>A few remarks are necessary:</p>
<ul>
<li>We will only store empirical conditional probabilities > 0 (more on that later)</li>
<li>We will store accumulative empirical conditional probabilities (more on that later)</li>
<li>The matrix will be created with a dictionnary of dictionnaries</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="c1"># Initialization of matrix</span>
<span class="n">prob</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">ak</span> <span class="ow">in</span> <span class="n">ak_chunks</span><span class="p">:</span>
<span class="c1"># New matrix line</span>
<span class="n">prob</span><span class="p">[</span><span class="n">ak</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
<span class="c1"># -- calculate p(b/a^k) for each symbol of alphabet</span>
<span class="n">pbak_cumul</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">symb</span> <span class="ow">in</span> <span class="n">alpha</span><span class="p">:</span>
<span class="n">pbak</span> <span class="o">=</span> <span class="n">conditional_empirical_proba</span><span class="p">(</span><span class="n">datafile_lines</span><span class="p">,</span> <span class="n">ak</span><span class="p">,</span> <span class="n">symb</span><span class="p">,</span> <span class="n">nak</span><span class="p">)</span>
<span class="c1"># cumulative probabilities</span>
<span class="n">pbak_cumul</span> <span class="o">+=</span> <span class="n">pbak</span>
<span class="c1"># if sucession ak+symb is encountered in text, add probability to matrix</span>
<span class="k">if</span> <span class="n">pbak</span> <span class="o">!=</span> <span class="mf">0.0</span><span class="p">:</span> <span class="c1"># Very important, if pbak = 0.0, the combination ak+symb will not be randomly generated</span>
<span class="n">prob</span><span class="p">[</span><span class="n">ak</span><span class="p">][</span><span class="n">symb</span><span class="p">]</span> <span class="o">=</span> <span class="n">pbak_cumul</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'../results/distribs/distrib_k</span><span class="si">%d</span><span class="s1">.txt'</span> <span class="o">%</span> <span class="p">(</span><span class="n">k</span><span class="p">),</span> <span class="s1">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">proba_file</span>
<span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">prob</span><span class="p">,</span> <span class="n">proba_file</span><span class="p">)</span>
</code></pre></div>
<h2 id="random-text-generation">Random text generation</h2>
<p>Close your eyes for a second, and think about what we just did. <strong>We calculated empirical transition probabilities between all existing strings of length 10 and all symbols of the alphabet, and stored the non nil acumulative probabilities in a matrix</strong>. (The non-nil part has two main advatages : it implies less storage cost, and we only store combinations that occured in the text. This way, random generation becomes really easy !)</p>
<p>It is now extremely easy to generate a text using these accumulative probabilities! Let's consider a quick example.</p>
<p><em>Example</em> : $a^k = AB$, $p(A/AB)=0.2$, $p(B/AB)=0.5$, $p(C/AB)=0.5$. We then store these acumulative values in the matrix:</p>
<ul>
<li>$p(A/AB)=0.2$</li>
<li>$p(B/AB)=0.7$</li>
<li>$p(C/AB)=1$</li>
</ul>
<p>That way, we only have to pick a random float between 0 and 1 using a uniform distribution to match this float with a symbol. <code>random(0,1) = 0.678 --> symbol = B</code></p>
<p>For this technique to work, the first $k=10$ symbols of the generated text must directly come from the original text (and hence will be contained in the matrix). This will give us a valid initial condition.</p>
<p>Let's now generate the text :</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">random_text</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">k</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> Given a result size and an integer k,</span>
<span class="sd"> returns a randomly generated text using</span>
<span class="sd"> probability distributions of markov chains</span>
<span class="sd"> of order k dumped in ../results/distribs/distrib_kX.txt</span>
<span class="sd"> files</span>
<span class="sd"> """</span>
<span class="c1"># -- Initial string</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'../data/monty.txt'</span><span class="p">,</span><span class="s1">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span>
<span class="n">initial_string</span> <span class="o">=</span> <span class="s1">' '</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">readlines</span><span class="p">())[:</span><span class="n">k</span><span class="p">]</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">initial_string</span>
<span class="c1"># -- Import probability distribution</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">p</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'../results/distribs/distrib_k</span><span class="si">%d</span><span class="s1">.txt'</span><span class="o">%</span><span class="p">(</span><span class="n">k</span><span class="p">),</span><span class="s1">'r'</span><span class="p">)</span>
<span class="k">except</span> <span class="ne">IOError</span> <span class="k">as</span> <span class="n">err</span><span class="p">:</span>
<span class="nb">print</span> <span class="n">err</span>
<span class="n">exit</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">distrib_matrix</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
<span class="n">p</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="c1"># -- Generate text following probability distribution</span>
<span class="n">kuple</span> <span class="o">=</span> <span class="n">initial_string</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">xrange</span><span class="p">(</span><span class="n">size</span><span class="p">):</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">char</span> <span class="o">=</span> <span class="s1">''</span>
<span class="c1"># read distribution specific to k-tuple string</span>
<span class="n">dist</span> <span class="o">=</span> <span class="n">distrib_matrix</span><span class="p">[</span><span class="n">kuple</span><span class="p">]</span>
<span class="k">for</span> <span class="n">symbol</span> <span class="ow">in</span> <span class="n">dist</span><span class="p">:</span>
<span class="n">char</span> <span class="o">=</span> <span class="n">symbol</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">dist</span><span class="p">[</span><span class="n">symbol</span><span class="p">]</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">></span> <span class="n">p</span><span class="p">:</span>
<span class="k">break</span>
<span class="n">out</span> <span class="o">+=</span> <span class="n">symbol</span>
<span class="n">kuple</span> <span class="o">=</span> <span class="n">kuple</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span><span class="o">+</span><span class="n">symbol</span> <span class="c1"># update k-tuple</span>
<span class="k">return</span> <span class="n">out</span>
</code></pre></div>
<p>Done ! Now, you only have to call the function <code>random_text(len_text, 10)</code> and BOOM !</p>
<h2 id="example-of-generated-text-with-k-10">Example of generated text with $k = 10$</h2>
<div class="highlight"><pre><span></span><code><span class="ss">"KING ARTHUR: Will you ask your master that we have been charged by God with a sacred quest. If he will give us food and shelter for the week.</span>
<span class="ss">ARTHUR: Will you ask your master if he wants to join my court at Camelot?!</span>
<span class="ss">SOLDIER #1: You're using coconuts!</span>
<span class="ss">ARTHUR: Ohh.</span>
<span class="ss">BEDEVERE: Uh, but you are wounded!</span>
<span class="ss">GALAHAD: What are you doing in England?</span>
<span class="ss">FRENCH GUARDS: [whispering] Forgive me that' and 'I'm not worth"</span>
</code></pre></div>
<h2 id="what-if-we-change-k">What if we change $k$ ?</h2>
<p>k can be interpreted as the quantity of context you take into account to calculate a symbol occurence probability. We chose $k = 10$, because a context of 10 symbols allows the program to generate a text with apparent sense (limited by the randomness of the process, and by the fact that THIS IS MONTY FREAKING PYTHON).</p>
<p>The more context you add, the more alike the generated and original texts will be, up to a point where they will be identical.</p>
<p>If you decrease k, you can find a interesting case where you generate words, but where the context is senseless.</p>
<p>Example, for $k=5$:</p>
<div class="highlight"><pre><span></span><code><span class="ss">"KING ARTHUR: Yes!</span>
<span class="ss">VILLAGER #3: A bit.</span>
<span class="ss">VILLAGER #1: You saw saw saw it, did you could</span>
<span class="ss">separate, and master that!</span>
<span class="ss">ARTHUR: Will you on Thursday.</span>
<span class="ss">CUSTOMER: What do you can you think kill your every</span>
<span class="ss">good people. It's one.)</span>
<span class="ss">OTHER FRENCH GUARDS: [whispering]"</span>
</code></pre></div>
<p>If you decrease $k$ even more, you will only generate rubbish.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We have seen a pretty simple text analysis technique which allows us to randomly generate a text, based on statistical analysis of the data text. This technique is based on the fact that the probability of occurence of a letter depends on its local "past".</p>
<p>Playing with the value of the "past length", you can generate text more or less alike to the original, and with more or less "sense".</p>
<p>This simple technique does not use the nltk python module, or a set of texts to generate "theoretical" rules on a language. Its is purely empirical.</p>
<p>All source code available on <a href="https://github.com/brouberol/Generate-Monty-Pyhon-Dialog" title="GitHub repository">GitHub</a>.</p>
<p><strong>EDIT</strong> : A nice comment from reddit:</p>
<blockquote>
<p>"This approach was first proposed by Claude Shannon in his landmark paper "A Mathematical Theory of Communication"… in 1948.
Gotta love how people keep reinventing the same things over and over again. But this time, in Python!"</p>
</blockquote>