Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression conference conferences containerisation css dailyprogrammer data analysis debugging defining ai demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics guide hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs latex learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation outreach own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering research resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

The legend of the disappearing data in Node.js

Happy leap day! :D

A green tree frog :D

_(Above: A nice green tree frog - source)_

Recently, I've been doing a bunch of work in Node.js streaming large amounts of data. For the most part the experience has been highly pleasurable, as Node.js makes it so easy! I have encountered a few pain points though, the most significant of which I'd like to talk about here.

In Node.js, streams come in 3 main forms:

  • Readable Streams
  • Writable Streams
  • Transform Streams

In addition, you can either plug streams together with the .pipe() method, write to them directly with the .write() method, or any combination thereof - allowing you to build up a chain of streams that enables data to flow through your program.

The problems start when you try and write large amounts of data to a stream directly:

import fs from 'fs';

import do_work from 'somewhere';
import get_some_stream from 'somewhere_else';

let stream_in = get_some_stream();
let out = fs.createWriteStream("/tmp/test.txt");
for(let i = 0; i < 1000000; i++) {
    out.write(do_work(stream_in, i))
}

(Above: Just an example of writing lots of data to a stream)

When this happens, you start to lose random chunks of data. The reason for this is not obvious, but it is buried in the Node.js docs:

The writable.write() method writes some data to the stream, and calls the supplied callback once the data has been fully handled. If an error occurs, the callback may or may not be called with the error as its first argument. To reliably detect write errors, add a listener for the 'error' event.

This is a huge pain. It means that you have to wrap all write calls like this:

"use strict";

/**
 * Writes data to a stream, automatically waiting for the drain event if asked.
 * @param   {stream.Writable}           stream_out  The writable stream to write to.
 * @param   {string|Buffer|Uint8Array}  data        The data to write.
 * @return  {Promise}   A promise that resolves when writing is complete.
 */
function write_safe(stream_out, data) {
    return new Promise((resolve, reject) => {
        // Handle errors
        let handler_error = (error) => {
            stream_out.off("error", handler_error);
            reject(error);
        };
        stream_out.on("error", handler_error);

        if(stream_out.write(data)) {
            // We're good to go
            stream_out.off("error", handler_error);
            resolve();
        }
        else {
            // We need to wait for the drain event before continuing
            stream_out.once("drain", () => {
                stream_out.off("error", handler_error);
                resolve();
            });
        }
    });
}

export { write_safe };

Such a huge boilerplate for such a simple task! Basically, if the .write() method returns false, you have to wait until the drain event is fired on the writeable stream before continuing to write to the stream. The reason for this I think is that it signals that the write buffer is full, and it needs to be drained before writing can continue.

This is ok, but it would be nice if this was abstracted away behind a single method, such as the wrapper I've shown above. Something like a async stream.Writable.writeAsync() would be great, but it doesn't currently exist.

I think I'm going to open an issue about it - since it seems very doable and just silly that it doesn't exist already.

Rust Review Redux

It was aaaages ago that I first reviewed Rust. For those not in the know, Rust is a next-generation compiled language (similar to Go, but this is where they diverge) developed by Mozilla - out of a need to have a safer alternative to C++ for writing key components of Firefox in.

Since then, I've obtained both a degree and a masters in computer science. I've also learnt a number of programming languages since then. I have been searching for a better alternative to C++ that's easier to use and doesn't fight you at every step - and I decided to give Rust another go.

After a few false starts, I managed to get going with starting to build a little web app (which will probably take a while until I can really show it off here). The tooling for the compiler is pretty good once you actually get it installed - although the installer itself is truly shocking (ly bad):

  • rustup - Manages multiple versions of Rust installed (I haven't used it much yet; apparently it's like nvm the Node Version Manager, but I don't use that either)
  • cargo - Orchestrates the building of your project and the installation of dependencies, which are known as crates.
  • rustc - The compiler itself. You probably won't interact with it directly much - instead going through cargo most of the time.

Together (and with the right Atom packages installed), they make for a relatively pleasant development experience. I mention the installer in particular though, because it's awful. I noted a number of issues with it:

  • The official website forces you to download an installation script that pipes to sh
  • It will only install on a per-user basis (goodbye disk space, hello extra system config complexity)
  • It doesn't even tell you how much disk space it's going to use (which wouldn't be an issue if they just setup an apt repository....)

These issues aside, other aspects of the experience were also worthy of note. First, the error messages the Rust compiler generates are actually useful. Much better than they were the last time I really dove into Rust, they provide you with much moree detail as to what's gone wrong, and there's even a special rustc --explain ERROR_CODE command you can execute to get more detail about what went wrong, why, and how to fix it.

This as a feature is certainly helpful for me as a beginner Rust programmer, but I think it's also a pretty essential feature given Rust's weirdness as a language.

I'm seriously not kidding - Rust is a nutty language. For one, classes exist.... sort of - but only as structs. Which are passed by reference (again, sort of) by default and may not contain methods - that's the job of an impl, which is short for an implementation. Implementations are a strange mix between C♯'s interfaces and multiple inheritance (in C++ I think it is?). And there are traits, which I haven't really looked into fully yet, but are a mix between interfaces and abstract classes..... you get the picture.

Point is, all this funky strangeness that goes on in Rust makes it a very challenging language to learn. A challenge that I feel is worth persevering with, but a challenge nonetheless. Rust does have a number of very powerful features that make it worth the effort, in my opinion.

For example, it catches entire classes of critically nasty bugs that plague other low-level systems languages such as C and C++ like use-after-free and the really awful concurrency race conditions at compile time - which is incredible, if you ask me. Such bugs have been a serious bother to many high-profile software projects that exist today and have caused a number of security issues. Rust is a testament to what can be achieved when you start from scratch and fix these issues by designing them out of the language.

For the curious, it does this by a complex system of variable lifetime, ownership, moving, and borrowing. I don't yet understand all the details, but the system enables the Rust compiler to be able to trace the lifetime of a variable at compile time, so you get the benefit of having a garbage collector without any of the overhead, since it's all been done at compile-time and built into your program that way.

This deep understanding of how data is passed around also yields performance and efficiency benefits too. C and C++ do not have such an understanding, so there are a number of performance optimisations the Rust compiler can make that would be considered far too dangerous for gcc to do. The net result of this is that sometimes code written in Rust will actually be faster than C and C++. This is a significant accomplishment, as the speed of C and C++ has been held as the gold standard for a long time (see exhibits A and B just for starters).

These are just some of the reasons that I'm persisting with learning Rust. So far, it seems like a "slow and steady wins the race" kinda deal - in that I'm taking it one concept at a time. There's a huge amount to take in, so I can't recommend that you try and do it all at once - time to consolidate what I've learnt so far is quite important I've found.

Rust is absolutely one of the hardest languages I've tried to learn, as it reinvents a lot of concepts which have been a staple of programming languages for a long time. However, it also comes with key benefits ease-of-use (once learnt, compared to C and C++), performance, and program execution safety at runtime (it was originally invented by Mozilla specifically to make Firefox a safer and faster browser, IIRC). To this end, I'm going to try my best to keep learning the language - and report back here at some point with cool stuff I've created (at the moment it's still in a state of flux and I'm refactoring heavily at each successive stage) :D

Edit: I've just remembered. I do currently have 2 big issues with rust: compilation time and disk space usage. When you install a dependency, it not only builds it from source,e but also recursively builds all of it's dependencies from source too. Not only does this take forever, but it also eats huge volumes of disk space for breakfast!

Found this interesting? Got some helpful advice or a question about Rust? Comment below!

PhD Update 2: The experiment, the data, and the supercomputers

Welcome to another PhD update post. Since last time, a bunch of different things have happened - which I'll talk about here. In particular, 2 distinct strands have become evident: The reading papers and theory bit - and the writing code and testing stuff out bit.

At the moment, I'm focusing much more heavily on the writing code and experimental side of things, as I've recently gained access to the 1km resolution rainfall radar dataset from CEDA. While I'm not allowed to share any data that I've now got, I'm pretty sure I'm safe to talk about how terribly it's formatted.

The data itself is with 1 directory per year, and 1 file per day inside those directories. Logical so far, right? Each of those files is a tar archive, inside which are the binary files that contain the data itself - with 1 file for every 5 minutes. These are compressed with gzip - which seems odd since they could probably gain greater compression ratios if they were tared first and compressed second (then the compression algorithm would be able to compress based on the similarities between files too).

The problems arise when you start to parse out the binary files themselves. They are in a propriety format - which has 3 different versions that don't conform to the (limited) documentation. To this end, it's been proving somewhat of a challenge to parse them and extract the bits I'm interested in.

To tackle this, I've been using Node.js and a bunch of libraries from npm (noisy pirate mutiny? nearest phase modulator? nasty popsicle machine? nah, it's probably node package manager):

  • binary-parser - For parsing the binary files themselves. Allows you to define the format of the file programmatically, and it'll parse it out into a nice object you can then manipulate.
  • gunzip-maybe - A streaming library that unzips a gzip-compressed stream
  • @icetee/ftp - An FTP client library for downloading the files (I know that FTP is insecure, that's all they offer at this time :-/)
  • tar-stream - For parsing tar files
  • nnng - Stands for No! Not National Grid!. helps with the conversion between OS national grid references and regular longitude latitude.

Aside from the binary file format, I encountered 3 main issues:

  1. The data is only a rectangle when using ordnance survey national grid references
  2. There's so much data, it needs to be streamed from the remote server
  3. Generating a valid gzip file is harder than you expect

Problem 1 here took me a while to figure out. Since as I mentioned the documentation is rather limited, I spent much longer than I would have liked attempting to parse the data in latitude longitude and finding it didn't work.

Problem 2 was rather interesting. Taking a cursory glance over the data before hand revealed that each daily tar file was about 80MiB - and with roughly 5.7K days worth of data (the dataset appears to go back to May 2004-ish), it quickly became clear that I couldn't just download them all and process them later.

It is for this reason that I chose Node.js in the first place for this. For those who haven't encountered it before, it's Javascript for the server - and it's brilliant for 2 main use-cases: networking and streaming data. Both of which were characteristics of the problem at hand - so the answer was obvious.

I'm still working on tweaking and improving my final solution, but as it stands after implementing the extractor on it's own, I've also implemented a wrapper that streams the tar archives from the FTP server, stream-reads the tar archives, streams the files in the tar archives into a gzip decompressor, parses the result, and then streams the interesting bits to disk as a disk object via a gzip compressor.

That's a lot of streams. The great part about this is that I don't accidentally end up with huge chunks of binary files in memory. The only bits that can't be streamed is the binary file parser and the bit that extracts the interesting bits.

I'm still working on the last issue, but I've been encountering nasty problems with the built-in zlib gzip compressor transformation stream. When I send a SIGINT (Ctrl + C) to the Node.js process, it doesn't seem to want to finish writing the gzip file correctly - leading to invalid gzip files with chunks missing from the end.

Since the zlib gzip transformation stream is so badly documented, I've ended up replacing it with a different solution that spawns a gzip child process instead (so you've got to have gzip installed on the machine you're running the script on, which shouldn't be a huge deal on Linux). This solution is better, but still requires some tweaks because it transpires that Node.js automatically propagates signals it receives to child processes - before you've had a chance to tie up all your loose ends. Frustrating.

Even so, I'm hopeful that I've pretty much got it to a workable state for now - though I'll need to implement a daemon-type script at some point to automatically download and process the new files as they are uploaded - it is a living dataset that's constantly being added to after all.

Papers

The other strand (that's less active at the minute) is reading papers. Last time, I mentioned the summary papers I'd read, and the direction I was considering reading in. Since then, I've both read a number of new papers and talked to a bunch of very talented people on-campus - so I've got a little bit of a better idea as to the direction I'm headed in now.

Firstly, I've looked into a few cutting-edge recurrent neural network types:

  • Grid LSTMs - Basically multi-dimensional LSTMs
  • Diluted LSTMs - Makes LSTMs less computationally intensive and better at learning long-term relationships
  • Transformer Neural Networks - more reading required here
  • NARX Networks

Many of these recurrent neural network structures appear to show promise for mapping floods. The last experiment into a basic LSTM didn't go too well (it turned out to be hugely computationally expensive), but learning from that experiment I've got lots of options for my next one.

A friend of mine managed to track down the paper behind Google's AI blog post - which turned out to be an interesting read. It transpires that despite the bold words in the blog post, the paper is more of an initial proposal for a research project - rather than a completed project itself. Most of the work they've done is actually using a traditional physics-based model - which they've basically thrown Google-scale compute power at to make pretty graphs - which they've then critically evaluated and identified a number of areas in which they can improve. They've been a bit light on details - which is probably because they either haven't started - or don't want to divulge their sekrets.

I also saw another interesting paper from Google entitled "Machine Learning for Precipitation Nowcasting from Radar Images", which I found because it was reported on by Ars Technica. It describes the short-term forecasting of rain from rainfall radar in the US (I'm in the UK) using a convolutional neural network-based model (specifically U-Net, apparently - I have yet to read up on it).

The model they use is comprised in part of convolutional neural network (CNN) layers that downsample 256x256 tiles to a smaller size, and then upscale it back to the original size. It has some extra connections that skip part of the model too. They claim that their model manages to improve on existing approaches for up to 6 hours in advance - so their network structure seems somewhat promising as inspiration for my own research.

Initial thoughts include theories as to whether I can use CNN layers like this to sandwich a more complex recurrent layer of some description that remembers the long-term relationships? I'll have to experiment.....

Found this interesting? Got a suggestion? Confused on a point? Comment below!

I've got an apt repository, and you can too

Hey there!

In this post, I want to talk about my apt repository. I've had it for a while, but since it's been working well for me I thought I'd announce it to wider world on here.

For those not in the know, an apt repository is a repository of software in a particular format that the apt package manager (found on Debian-based distributions such as Ubuntu) use to keep software on a machine up-to-date.

The apt package manager queries all repositories it has configured to find out what versions of which packages they have available, and then compares this with those locally installed. Any packages out of date then get upgraded, usually after prompting you to install the updates.

Linux distributions based on Debian come with a large repository of software, but it doesn't have everything. For this reason, extra repositories are often used to deliver updates to software automatically from third parties.

In my case, I've been finding increasingly that I've been wanting to deliver updates for software that isn't packaged for installation with apt to a number of different machines. Every time I get around to installing update it felt like it was time to install another, so naturally I got frustrated enough with it that I decided to automate my problems away by scripting my own apt repository!

My apt repository can be found here: https://starbeamrainbowlabs.com/

It comes in 2 parts. Firstly, there's the repository itself - which is managed by a script that's based on my lantern build engine. It's this I'll be talking about in this post.

Secondly, I have a number of as yet adhoc custom Laminar job scripts for automatically downloading various software projects from GitHub, such that all I have to do is run laminarc queue apt-softwarename and it'll automatically package the latest version and upload it to the repository itself, which has a cron job set to fold in all of the new packages at 2am every night. The specifics of this are best explain in another post.

Currently this process requires me to login and run the laminarc command manually, but I intend to automate this too in the future (I'm currently waiting for a new release of beehive to fix a nasty bug for this).

Anyway, currently I have the following software packaged in my repository:

  • Gossa - A simple HTTP file browser
  • The Tiled Map Editor - An amazing 2D tile-based graphical map editor. You should sponsor the developer via any of the means on the Tiled Map Editor's website before using my apt package.
  • tldr-missing-pages - A small utility script for finding tldr-pages to write
  • webhook - A flexible webhook system that calls binaries and shell scripts when a HTTP call is made
    • I've also got a pleaserun-based service file generator packaged for this too in the webhook-service package

Of course, more will be coming as and when I discover and start using cool software.

The repository itself is driven by a set of scripts. These scripts were inspired by a stack overflow post that I have since lost, but I made a number of usability improvements and rewrote it to use my lantern build engine as I described above. I call this improved script aptosaurus, because it sounds cool.

To use it, first clone the repository:

git clone https://git.starbeamrainbowlabs.com/sbrl/aptosaurus.git

Then, create a new GPG key to sign your packages with:

gpg --full-generate-key

Next, we need to export the new keypair to disk so that we can use it in scripts. Do that like this:

# Identify the key's ID in the list this prints out
gpg --list-secret-keys
# Export the secret key
gpg --export-secret-keys --armor INSERT_KEY_ID_HERE >secret.gpg
chmod 0600 secret.gpg # Don't forget to lock down the permissions
# Export the public key
gpg --export --armor INSERT_KEY_ID_HERE >public.gpg

Then, run the setup script:

./aptosaurus.sh setup

It should warn you if anything's amiss.

With the setup complete, you can new put your .deb packages in the sources subdirectory. Once done, run the update command to fold them into the repository:

./aptosaurus.sh update

Now you've got your own repository! Your next step is to setup a static web server to serve the repo subdirectory (which contains the repo itself) to the world! Personally, I use Nginx with the following config:

server {
    listen  80;
    listen  [::]:80;
    listen  443 ssl http2;
    listen  [::]:443 ssl http2;

    server_name apt.starbeamrainbowlabs.com;
    ssl_certificate     /etc/letsencrypt/live/$server_name/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/$server_name/privkey.pem;

    #add_header strict-transport-security "max-age=31536000;";
    add_header x-xss-protection "1; mode=block";
    add_header x-frame-options  "sameorigin";
    add_header link '<https://starbeamrainbowlabs.com$request_uri>; rel="canonical"';

    index   index.html;
    root    /srv/aptosaurus/repo;

    include /etc/nginx/snippets/letsencrypt.conf;

    autoindex   off;
    fancyindex  on;
    fancyindex_exact_size   off;
    fancyindex_header   header.html;

    #location ~ /.well-known {
    #   root    /srv/letsencrypt;
    #}

}

This requires the fancyindex module for Nginx, which can be installed with sudo apt install libnginx-mod-http-fancyindex on Ubuntu-based systems.

To add your new apt repository to a machine, simply follow the instructions for my repository, replacing the domain name and the key ids with yours.

Hopefully this release announcement-turned-guide has been either interesting, helpful, or both! Do let me know in the comments if you encounter any issues. If there's enough interest I'll migrate the code to GitHub from my personal Git server if people want to make contributions (express said interest in the comments below).

It's worth noting that this is only a very simply apt repository. Larger apt repositories are sectioned off into multiple categories by distribution and release status (e.g. the Ubuntu repositories have xenial, bionic, eoan, etc for the version number of Ubuntu, and main, universe, multiverse, restricted, etc for the different categories of software).

If you setup your own simple apt repository using this guide, I'd love it if you could let me know with a comment below too.

Switching TOTP providers from Authy to andOTP

Since I first started using 2-factor authentication with TOTP (Time based One Time Passwords), I've been using Authy to store my TOTP secrets. This has worked well for a number of years, but recently I decided that I wanted to change. This was for a number of reasons:

  1. I've acquired a large number of TOTP secrets for various websites and services, and I'd like a better way of sorting the list
  2. Most of the web services I have TOTP secrets for don't have an icon in Authy - and there are only so many times you can repeat the 6 generic colours before it becomes totally confusing
  3. I'd like the backups of my TOTP secrets to be completely self-hosted (i.e. completely on my own infrastructure)

After asking on Reddit, I received a recommendation to use andOTP (F-Droid, Google Play). After installing it, I realised that I needed to export my TOTP secrets from Authy first.

Unfortunately, it turns out that this isn't an easy process. Many guides tell you to alter the code behind the official Authy Chrome app - and since I don't have Chrome installed (I'm a Firefox user :D), that's not particularly helpful.

Thankfully, all is not lost. During my research I found the authy project on GitHub, which is a command-line app - written in Go - temporarily registers as a 'TOTP provider' with Authy and then exports all of your TOTP secrets to a standard text file of URIs.

These can then be imported into whatever TOTP-supporting authenticator app you like. Personally, I did this by generating QR codes for each URI and scanning them into my phone. The URIs generated, when converted to a QR code, are actually in the same format that they were originally when you scan them in the first place on the original website. This makes for an easy time importing them - at least from a walled garden.

Generating all those QR codes manually isn't much fun though, so I automated the process. This was pretty simple:

#!/usr/bin/env bash
exec 3<&0; # Copy stdin
while read url; do
    echo "${url}" | qr --error-correction=H;
    read -p "Press a enter to continue" <&3; # Pipe in stdin, since we override it with the read loop
done <secrets.txt;

The exec 3<&0 bit copies the standard input to file descriptor 3 for later. Then we enter a while loop, and read in the file that contains the secrets and iterate over it.

For each line, we convert it to a QR code that displays in the terminal with VT-100 ANSI escape codes with the Python program qr.

Finally, after generating each QR code we pause for a moment until we press the enter key, so that we can generate the QR codes 1 at a time. We pipe in file descriptor 3 here that we copied earlier, because inside the while loop the standard input is the file we're reading line-by-line and not the keyboard input.

With my secrets migrated, I set to work changing the labels, images, and tags for each of them. I'm impressed by the number of different icons it supports - and since it's open-source if there's one I really want that it doesn't have, I'm sure I can open a PR to add it. It also encrypts the TOTP secrets database at rest on disk, which is pretty great.

Lastly came the backups. It looks like andOTP is pretty flexible when it comes to backups - supporting plain text files as well as various forms of encrypted file. I opted for the latter, with GPG encryption instead of a password or PIN. I'm sure it'll come back to bite me later when I struggle to decrypt the database in an emergency because I find the gpg CLI terribly difficult to use - perhaps I should take multiple backups encrypted with long and difficult password too.

To encrypt the backups with GPG, you need to have a GPG provider installed on your phone. It recommended that I install OpenKeychain for managing my GPG private keys on Android, which I did. So far, it seems to be functioning as expected too - additionally providing me with a mechanism by which I can encrypt and decrypt files easily and perform other GPG-related tasks...... if only it was this easy in the Linux terminal!

Once setup, I saved my encrypted backups directly to my Nextcloud instance, since it turns out that in Android 10 (or maybe before? I'm not sure) it appears that if you have the Nextcloud app installed it appears as a file system provider when saving things. I'm certainly not complaining!

While I'm still experimenting with my new setup, I'm pretty happy with it at the moment. I'm still considering how I can make my TOTP backups even more secure while not compromising the '2nd factor' nature of the thing, so it's possible I might post again in the future about that.

Next on my security / privacy todo list is to configure my Keepass database to use my Solo for authentication, and possibly figure out how I can get my phone to pretend to be a keyboard to input passwords into machines I don't have my password database configured on :D

Found this interesting? Got a suggestion? Comment below!

Cluster, Part 1: Answers only lead to more questions

At home, I have a Raspberry Pi 3B+ as a home file server. Lately though, I've been noticing that I've been starting to grow out of it (both in terms of compute capacity and storage) - so I've decided to get thinking early about what I can do about it.

I thought of 2 different options pretty quickly:

  • Build a 'proper' server
  • Build a cluster instead

While both of these options are perfectly viable and would serve my needs well, one of them is distinctly more interesting than the other - that being a cluster. While having a 'proper' server would be much simpler, perhaps slightly more power efficient (though I would need tests to confirm, since ARM - the CPU architecture I'm planning on using for the cluster - is more power efficient), and would put all my system resources on the same box, I like the idea of building a cluster for a number of reasons.

For one, I'll learn new skills setting it up and managing it. So far, I've been mostly managing my servers by hand. When you start to acquire a number of machines though, this quickly becomes unwieldy. I'd like to experiment with a containerisation technology (I'm not sure which one yet) and play around with distributing containers across hosts - and auto-restarting them on a different host if 1 host goes down. If this is decentralised, even better!

For another, having a single larger server is a single point of failure - which would be relatively expensive to replace. If I use lots of small machines instead, then if 1 dies then not only is it cheaper to replace, but it's also not as urgent since the other machines in the cluster can take over while I order a replacement.

Finally, having a cluster is just cool. Do we really need more of a reason than this?

With all this in mind, I've been thinking quite a bit about the architecture of such a cluster. I haven't bought anything yet (and probably won't for a while yet) - because as you may have guessed from the title of this post I've been running into a number of issues that all need researching.

First though let's talk about which machines I'm planning on using. I'm actually considering 2 clusters, to solve 2 different issues: compute and storage. Compute refers to running applications (e.g. Nextcloud etc), and storage refers to a distributed storage mechanism with multiple hosts - each with 1 drive attached - though I'm unsure about the storage cluster at this stage.

For the compute cluster, I'm leaning towards 4 x Raspberry Pi 4 with 4GiB of RAM each. For the storage cluster, I'm considering a number of different boards. 3 identical boards of 1 of the following:

I do seem to remember a board that had USB 3 onboard, which would be useful for connecting to the external drives. Currently the plan is to use a SATA to USB converter connect to internal HDDs (e.g. WD Greens) - but I have yet to find one that doesn't include the power connector or splits the power off into a separate USB cable (more on power later). This would be all be backed up by a Gigabit switch of some description (so the Rock Pi S is not a particularly attractive option, since it would be limited to 100MiB).

I've been using HackerBoards.com to discover different boards which may fit my project - but I'm not particularly satisfied with any of the options here so far. Specifically, I'm after Gigabit Ethernet and USB 3 on the same board if possible.

The next issue is software support. I've been bitten by this before, so I'm being extra cautious this time. I'm after a board that provides good software support, so that I can actually use all the hardware I've paid for.

The other thing relating to software that I'd like if possible is the ability to use a systemd-free operating system. Just like before, when I selected Manjaro OpenRC (which is now called Artix Linux), since I already have a number of systems using systemd I would like to balance it out a bit with some systems that use something else. I don't really mind if this is OpenRC, S6, or RunIt - just that it's something different to broaden my skill set.

Unfortunately, it's been a challenge to locate a distribution of Linux that both has broad support for ARM SoCs and does not use systemd. I suspect that I may have to give up on this, but I'm still holding out hope that there's a distribution out there that can do what I want - even if I have to prepare the system image myself (Alpine Linux looks potentially promising, but at the moment it's a huge challenge to figure out whether a chipset supported or not....). Either way, from my research it looks like having mainline Linux kernel support fro my board of choice is critically important to ensure continued support and updates (both feature and security) in the future.

Lastly, I also have power problems. Specifically, how to power the cluster. The big problem is that the Raspberry Pi 4 requires 3A of power max - instead the usual 2.4A in the 3B+ model. Of course, it won't be using this all the time, but it's apparently important that the ceiling of the power supply is 3A to avoid issues. Problem is, most multi-port chargers can barely provide 2A to connected devices - and I have not yet seen one that would provide 3A to 4+ devices and support additional peripherals such as hard drives and other supporting boards as described above.

To this end, I may end up having to build my own power supply from an old ATX supply that you can find in an old desktop PC. These can generally supply plenty of power (though it's always best to check) - but the problem here is that I'd need to do a ton of research to make sure that I wire it up correctly and safely, to avoid issues there too (I'm scared of blowing a fuse or electrocuting someone etc).

This concludes my first blog post on my cluster plans. It may be a while until the next one, as I have lots more research to do before I can continue. Suggestions and tips are welcomed in the comments below.

Demystificating VPNs

After seeing yet another article that misunderstands and misrepresents VPNs, I just hda to make a post about it. This post actually started life as a reddit comment, but I decided to expand on it and make it a full post here on my blog.

VPNs are a technology that simply sends your traffic through an encrypted tunnel that pops out somewhere else.

For the curious, they do this by creating what's known as a virtual tunnel interface on your computer (on Linux-based machines this is often tun0) and alter your machine's routing table to funnel all your network traffic destined for the Internet into the tunnel interface.

The tunnel interface actually encrypts your data and streams it to a VPN server (though there are exceptions) that then forwards it on to the wider Internet for you.

This is great if:

  • You live in a country that censors your Internet connection
  • You don't trust your ISP
  • You are connected to a public open WiFi hotspot
  • You need to access resources on a remote network that only allow those physically present to use them

By using a VPN, you can make your device appear as though it is somewhere else. You can also hide your Internet traffic from the rest of your network that you are connected to.

However, VPNs are not a magic bullet. They are not so great at:

  • Blocking trackers
  • Blocking Ads
  • Blocking mining scripts that suck up your CPU
  • Limiting the amount of your data online services get a hold of

This is because it simply makes you appear as though you are somewhere else - it doesn't block or alter any of the traffic coming and going from your device - online services can still see all your personal data - all that's changed when you use a VPN is that your data is going via a waypoint on it's journey to it's final destination.

All hope is not lost, however - for there are steps you can take to deal with these issues. Try these steps instead:

You may already be aware of these points - but in particular multi-account containers are quite interesting. By using an extension like the one I link to above, you can in effect have multiple browsers open at the same time. By this, I mean that you can have multiple 'sandboxes' - and the site data (e.g. cookies etc) will not cross over from 1 sandbox to another.

This gives websites the illusion of being loaded in multiple different environments - with limited options to figure out that they are in fact on the same machine - especially when combined with other measures.

Hopefully this clears up some of the confusion. If you know anyone else who's confused about VPNs, please share a link to this post with them! The fewer people who get the wrong idea about VPNs, the better.

Found this interesting? Have another privacy related question? Found an error in this post? Comment below!

Own Your Code Series List

Hey there! It's time for another series list. This time it's for my Own Your Code series, where I take a look into Gitea and Laminar CI.

Following this series, I plan to also post about my apt repository, which is hosting a growing list of software - including the tiled map editor (support them with a donation if you can), gossa (a minimalist file browser interface), and webhook - if you find any issues, you can always get in touch.

Anyway, here's the full list of posts in the series in the Own Your Code series:

In the unlikely event I post another entry in this series, I'll come back and update this list. Most likely though I'll be posting related things standalone, rather than part of this series - so subscribe for updates with your favourite method if you'd like to stay up-to-date with my latest blog posts (Atom/RSS, Email, Twitter, Reddit, and Facebook are all supported - just ask if there's something missing).

Why the TICK stack probably isn't for me

Recently, I've been experimenting with upgrading my monitoring system. The TICK stack consists of a number of different elements:

Together, these 4 programs provide everything you need to monitor your infrastructure, generate graphs, and send alerts via many different channels when things go wrong. This works reasonably well - and to give the developers credit, the level of integration present is pretty awesome. Telegraf seamlessly inserts metrics into InfluxDB, and Chronograf is designed to integrate with the metrics generated by Telegraf.

I haven't tried Kapacitor much yet, but it has an impressive list of integrations. For reference, I've been testing the TICK stack on an old Raspberry Pi 2 that I had lying around. I also tried Grafana too, which I'll talk about later.

The problems start when we talk about the system I've been using up until now (and am continuing to use). I've got a Collectd setup going - with Collectd Graph Panel (CGP) as a web interface, which is backed by RRD databases.

CGP, while it has it's flaws, is pretty cool. Unlike Chronograf, it doesn't require manual configuration when you enable new metric types - it generates graphs automatically. For a small personal home network, I don't really want to be spending hours manually specifying what all the graphs should look like for all the metrics I'm collecting. It's seriously helpful to have it done automatically.

Grafana also stumbles here. Before I installed the CK part of the TICK stack, I tried Grafana. After some initial installation issues (the Raspberry Pi 2's CPU only supports up to ARMv6, and Grafana uses ARMv7l instructions, causing some awkward and unfortunate issues that were somewhat difficult to track down). While it has an incredible array of different graphs and visualisations you can configure, like Chronograf it doesn't generate any of these graphs for you automatically.

Both solutions do have an import / export system for dashboards, which allows you to share prebuilt dashboards - but this isn't the same as automatic graph generation.

The other issue with the TICK stack is how heavy it is. Spoiler: it's very heavy indeed - especially InfluxDB. It managed to max out my poor Raspberry Pi 2's CPU - and ate all my RAM too! It look quite a bit of tuning to configure it such that it didn't eat all of my RAM for breakfast and knock my SSH session offline.

I'm sure that in a business setting you'd have heaps of resources just waiting to be dedicated to monitoring everything from your mission-critical servers to your cat's lunch - but in a home setting it takes up more resources passively when it isn't even doing anything than everything else I'm monitoring..... combined!

It's for these reasons that I'm probably not going to end up using the TICK (or TIG, for that matter) stack. For the reasons I've explained above, while it's great - it's just not for me. What I'm going to use instead though, I'm not sure. Development on CGP ceased in 2017 (or probably before that) - and I've got a growing list of features I'd like to add to it - including (but not limited to) fixing the SMART metrics display, reconfiguring the length of time metrics are stored for, and fixing a super annoying bug that makes the graphs go nuts when you scroll on them on a touchpad with precise scrolling enabled.

Got a suggestion for another different system I could try? Comment below!

Happy Christmas 2019!

I've been otherwise occupied enjoying my Christmas holiday this year - and I hope you have been too. Have a picture of a bauble:

Art by Mythdael