Stardust | Starbeamrainbowlabs

A snapshot into my PhD: Rainfall radar model debugging

Hello again!

The weather is cold over here right now, and it's also been a while since I posted about my PhD in some detail, so I thought while I get my thoughts in order for a new PhD update blog post I'd give you a snapshot into what I've been doing in the last 2 weeks.

If you're not interested in nitty gritty details, I'll be posting a higher-level summary soon in my next PhD update blog post.

For context, since wrapping up (for now, more on this in the PhD update blog post) the social media side of my PhD, I've returned to the rainfall radar half of my PhD and implementing and debugging several AI models to predict water depth in real time. If you're thinking of doing a PhD yourself, this is in no way representative of what a PhD is like! Each PHD is different - mine just happens to include lots of ~~banging my head against a wall~~ debugging.

To start with, recently I've found and fixed a nasty bug in the thresholding function, which defaulted to a value of 1.5 instead of 0.1. My data is stored in .tfrecord files with pairs of rainfall radar and water depth 'images'. When the model reads these in, it will 'threshold' the water depth: for each pixel setting a value of 0 for pixels with water depth lower than a given threshold, and 1 for pixels above.

The bug in question manifested itself as an accuracy of 99%/100%, which is extremely unlikely given the nature of the task I'm asking it to predict. After some extensive debugging (including implementing a custom loss function that wrapped several different other loss functions, though not all at the same time), I found that the default value for the threshold was 1.5 (metres) instead of what it should have been - 0.1 (again, metres).

After fixing this, the accuracy lowered to 83% - the proportion of pixels in the input that were not water.

The model in question predicts water depth in 2D, taking in rainfall radar data (also in 2D) as an input. It uses ConvNeXt as an encoder, and an inverted ConvNeXt as the decoder. For the truly curious, the structure of this model as of the end of this section of the post can be found in the summary.txt file here.

Although I'd fixed the bug, I still had a long way to go. An accuracy of 83% is, in my case, no better than random guessing..... unfortunately completely ignoring the minority class.

In an attempt to get it to stop ignoring the minority class, I tried (in no particular order):

Increasing the learning rate to 0.1
Summing the output of the loss function instead of doing tf.math.reduce_sum(loss_output) / batch_size, as is the default
Multiplying the rainfall radar values by 100 (standard deviation of the rainfall radar data is in the order fo magnitude of 0.01, mean 0.0035892173182219267, min 0, max 1)
Adding an extra input channel with the heightmap (cross-entropy, loss: 0.583, training accuracy: 0.719)
Using the dice loss function instead of one-hot categorical cross-entropy
Removing the activation function (GeLU in my case) from the last few layers of the model (helpful, but I tried this with dice and not cross entropy loss, and switching back would be a bit of a pain)

Unfortunately, none of these things fixed the underlying issue of the model not learning anything.

Dice loss was an interesting case - I have some Cool Graphs to show on this:

Here, I compare removing the activation function (GeLU) from the last few layers (ref) with not removing the activation function from the last few layers. Clearly, removing it helps significantly, as the loss actually has a tendency to go in a downward direction instead of rocketing sky high.

This shows the accuracy and validation accuracy for the model without the activation function the last few layers. Unfortunately, the Dice loss function has some way to go before it can compete with cross-entropy. I speculate that while dice is cool, it isn't as useful on it's own in this scenario.

It would be cool to compare having no activation function in the last few layers and using cross-entropy loss to my previous attempts, but I'm unsure if I'll have time to noodle around with that.

In terms of where I got the idea to use the dice loss function from, it's from this GitHub repo and its associated paper: https://github.com/shruti-jadon/Semantic-Segmentation-Loss-Functions.

It has a nice summary of loss functions for image segmentation and their uses / effects. If/when DeepLabV3+ actually works (see below) and I have some time, I might return to this to see if I can extract a few more percentage points of accuracy from whatever model I end up with.

DeepLabV3+

Simultaneously with the above, I've been reading into existing image segmentation models. Up until now, my hypothesis has been that a model well connected with skip connections, such as U-Net, would not be ideal in this situation, as the input (rainfall radar) is so drastically different from the output (water depth) it would not be ideal to have a model with skip connections, as they encourage the output to be more similar to the input, which is not really what I want.

Now, however, I am (finally) going to test this (long running) hypothesis to see if it's really true. To do this, I needed to find the existing state-of-the-art image segmentation model. To summary long hours of reading, I found the following models:

SegNet (bad)
FCN (bog standard, also bad, maybe this paper)
U-Net (heard of this before)
PSPNet (like a pyramid structure, was state of the art but got beaten recently)
DeepLabV3 (PSPNet but not quite as good)
DeepLabV3+ (terrible name to search for but the current state of the art, beats PSPNet)

To this end, I've found myself a DeepLabV3+ implementation on keras.io (the code is terrible and so full of spaghetti I could eat it for breakfast and still have some left over) and I've tested it with the provided dataset, which seems to work fine:

....though there seems to be a bug in the graph plotting code, in that it doesn't clear the last line plotted.

Not sure if I can share the actual segmentations that it produces, but I can say that while they are a bit rough around the edges, it seems to work fine.

The quality of the segmentation is somewhat lacking given the training data only consisted of ~1k images and it was only trained for ~25 epochs. It has ~11 million parameters or so. I'm confident that more epochs and more data would improve things, which is good enough for me so my next immediate task will be to push my own data though it and see what happens.

I hypothesise that models like DeepLabV3+ etc also bring a crucial benefit to the table: training stability. Given the model's more interconnectedness with skip connections etc, backpropagation needs to travel overall less far to cover the entire model and update the weights.

If you're interested, a full summary of this DeepLabV3+ model can be found here: https://starbeamrainbowlabs.com/blog/images/20221215-DeeplabV3+_summary.txt

If it does work, I've implemented recently a cool attention mechanism called Convolutional Block Attention Module (CBAM), which looks seriously cool. I'd like to try adding it to the DeepLabV3+ model to see if it increases the accuracy of the output.

Finally, a backup plan is in order in case it doesn't work. My plan is to convolve over the input rainfall radar data and make a prediction for a single water depth pixel at a time, using ConvNeXt as an image encoder backbone (though I may do tests with other backbones such as its older cousin ResNet simultaneously just in case, see also my post on image encoders), keeping the current structure of 7 channels rainfall radar + 1 channel heightmap.

While this wouldn't be ideal (given you'd need to push though multiple batches just to get a single 2D prediction), the model to make such predictions would be simpler and more likely to work right off the bat.

Conclusion

I've talked a bunch about my process and thoughts on debugging my rainfall radar to water depth model and trying to get it to work. Taking a single approach at a time to problems like this isn't usually the best idea, so I'm also trying something completely new in DeepLabV3+ to see if it will work.

I also have a backup plan in a more traditional image encoder-style model that will predict a single pixel at a time. As I mentioned at the beginning of this blog post, every PhD is different, so this is not representative of what you'd be doing on yours if you decide to do one / are doing one / have done one. If you are thinking of doing a PhD, please do get in touch if you're interested in hearing more about my experiences doing one and what you could expect.

I'm on Mastodon/Fediverse!

An ocean at sunset, vector done by me in Inkscape.

Hello there once again!

This is just a quick post to let you know that I can now be found on the fediverse! My account name is @sbrl@noc.social (update September 2023: I'm now @sbrl@fediscience.org on a public Mastodon instance for now, but it may change in the future (I'll update this post if/when I change it).

At the moment, posts are copied from Twitter automatically by this neat cool, which means that automated tweets about new blog posts and Pepperminty Wiki releases, and other tweets I make will also be copied over to my fediverse account.

In the future, I would like to use my new n8n instance that's now running on my cluster to automate blog post notifications instead.

While I'm on the topic, I also now have a Discord server you can join, which also has an announcements channel where blog posts are posts by the above n8n instance I have setup. It also has a channel or two for chatting, which I'll expand and change as necessary. You can join it with this link:

https://discord.gg/aQd6yDNcGV

Finally, as more of a maintenance thing, the automated posts to the subreddit have also been switched over to n8n from their previous home in IFTTT. The (eventual) plan is to switch everything over to my self-hosted n8n instance.

If there are any other social media networks you'd like me to post blog post updates etc to that are not currently listed on my homepage, please leave a comment below and I'll take a look to see how feasible it is.

In terms of new blog posts coming, I have on memtest86+ that's almost finished, but I just need to disentangle memtest86+ and figure out where to download a stable version that supports UEFI from. It's also about time I write another PhD update blog post. And, of course, there are also the random posts that spring out of what I'm currently working on, which recently has been about my PhD AI development (hit a really nasty snag recently which was fun to solve and debug; if I can adapt it into a blog post I'll post it here).

If there's anything you'd like me to blog about, please let me know in a comment below.

--Starbeamrainbowlabs

Edit September 2023: My account is now @sbrl@fediscience.org, and unfortunately the automated cross-poster has broken :-( The fediverse is now my primary social media account.

AI encoders demystified

When beginning the process of implementing a new deep learning / AI model, alongside deciding on the input and outputs of the model and converting your data to tfrecord files (file extension: .tfrecord.gz; you really should, the performance boost is incredible) is designing the actual architecture of the model itself. It might be tempting to shove a bunch of CNN layers at the problem to make it go away, but in this post I hope to convince you to think again.

As with any field, extensive research has already been conducted, and when it comes to AI that means that the effective encoding of various different types of data with deep learning models has already been carefully studied.

To my understanding, there are 2 broad categories of data that you're likely to encounter:

Images: photos, 2D maps of sensor readings, basically anything that could be plotted as an image
Sequenced data: text, sensor readings with only a temporal and no spatial dimension, audio, etc

Sometimes you'll encounter a combination of these two. That's beyond the scope of this blog post, but my general advice is:

If it's 3D, consider it a 2D image with multiple channels to save VRAM
Go look up specialised model architectures for your problem and how well they work/didn't work
Use a Convolutional variant of a sequenced model or something
Pick one of these encoders and modify it slightly to fix your use case

In this blog post, I'm going to give a quick overview of the various state of the art encoders for the 2 categories of data I've found on my travels so far. Hopefully, this should give you a good starting point for building your own model. By picking an encoder from this list as a starting point for your model and reframing your problem just a bit to fit, you should find that you have not only a much simpler problem/solution, but also a more effective one too.

I'll provide links to the original papers for each model, but also links to any materials I've found that I found helpful in understanding how they work. This can be useful if you find yourself in the awkward situation of needing to implement them yourself, but it's also generally good to have an understanding of the encoder you ultimately pick.

I've read a number of papers, but there's always the possibility that I've missed something. If you think this is the case, please comment below. This is especially likely if you are dealing with audio data, as I haven't looked into handling that too much yet.

Finally, also out of scope of this blog post are the various ways of framing a problem for an AI to understand it better. If this interests you, please comment below and I'll write a separate post on that.

Images / spatial data

ConvNeXt: Image encoders are a shining example of how a simple stack of CNNs can be iteratively improved through extensive testing, and in no encoder is this more apparent than in ConvNeXt.

In 1 sentence, a ConvNeXt model can be summarised as "a ResNet but with the features of a transformer". Researchers took a bog-standard ResNet, and then iteratively tested and improved it by adding various features that you'd normally find in a transformer, which results in the model performant (*read: had the highest accuracy) encoder when compared to the other 2 on this list.

ConvNeXt is proof that CNN-based models can be significantly improved by incorporating more than just a simple stack of CNN layers.

Paper: A ConvNet for the 2020s
Explanation: ConvNext: The Return Of Convolution Networks
Implementation: https://github.com/leanderme/ConvNeXt-Tensorflow

Vision Transformer: Born from when someone asked what should happen if they tried to put an image into a normal transformer (see below), Vision Transformers are a variant of a normal transformer that handles images and other spatial data instead of sequenced data (e.g. sentences).

The current state of the art is the Swin Transformer to the best of my knowledge. Note that ConvNeXt outperforms Vision Transformers, but the specifics of course depend on your task.

Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Explanation: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows - Committed towards better future
Implementation: https://github.com/microsoft/Swin-Transformer (PyTorch, if you know of a simple Tensorflow implementation please let me know)

ResNet: Extremely popular, you may have heard of the humble ResNet before. It was originally made a thing when someone wanted to know what would happen if they took the number of layers in a CNN-based model to the extreme. It has a 'residual connection' between blocks of layers, which avoids the 'vanishing gradient problem' which in short is when your model is too deep, and when it backpropagates the error the gradient it adjusts the weight by becomes so small that it has hardly any effect.

Since this model was invented, skip connections (or 'residual connections', as they are sometimes known) have become a regular feature in all sorts of models - especially those deeper than just a couple of layers.

Despite this significant advancement though, I recommend using a ConvNeXt encoder instead of images - it's better than and the architecture more well tuned than a bog standard ResNet.

Sequenced Data

Transformer: The big one that is quite possibly the most famous encoder-decoder ever invented. The Transformer replaces the LSTM (see below) for handling sequenced data. It has 2 key advantages over an LSTM:

It's more easily paralleliseable, which is great for e.g. training on GPUs
The attention mechanism it implements revolutionises the performance of like every network ever

The impact of the Transformer on AI cannot be overstated. In short: If you think you want an LSTM, use a transformer instead.

Paper: Attention is all you need
Explanation: The Illustrated Transformer
Implementation: Many around, but I have yet to find a simple enough one, so I implemented it myself from scratch. While I've tested the encoder and I know it works, I have yet to fully test the decoder. Once I have done this, I will write a separate blog post about it and add the link here.

LSTM: Short for Long Short-Term Memory, LSTMs were invented in 1997 to solve the vanishing gradient problem, in which gradients when backpropagating back through recurrent models that handle sequenced data shrink to vanishingly small values, rendering them ineffective at learning long-term relationships in sequenced data.

Superceded by the (non recurrent) Transformer, LSTMs are a recurrent model architecture, which means that their output feeds back into themselves. While this enables some unique model architectures for learning sequenced data (exhibit A), it also makes them horrible at being parallelised on a GPU (the dilated LSTM architecture attempts to remedy this, but Transformers are just better). The only thing that stands them apart from the Transformer is that the Transformer is somehow not built-in into Tensorflow as standard yet, whereas LSTMs are.

Just like Vision Transformers adapted the Transformer architecture for multidimensional data, so too are Grid LSTMs to normal LSTMs.

Paper: Long Short-Term Memory
Explanation: Understanding LSTMs,
Implementation: Tensorflow: tf.keras.layers.LSTM, PyTorch: torch.nn.LSTM

Conclusion

In summary, for images encoders in priority order are: ConvNeXt, Vision Transformer (Swin Transformer), ResNet, and for text/sequenced data: Transformer, LSTM.

We've looked at a small selection of model architectures for handling a variety of different data types. This is not an exhaustive list, however. You might have another awkward type of data to handle that doesn't fit into either of these categories - e.g. specialised models exist for handling audio, but the general rule of thumb is an encoder architecture probably already exists for your use-case - even if I haven't listed it here.

Also of note are alternative use cases for data types I've covered here. For example, if I'm working with images I would use a ConvNeXt, but if model prediction latency and/or resource consumption mattered I would consider using a MobileNet](https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v3), which while a smaller model is designed for producing rapid predictions in lower resource environments - e.g. on mobile phones.

Finally, while these are encoders decoders also exist for various tasks. Often, they are tightly integrated into the encoder. For example, the U-Net is designed for image segmentation. Listing these is out of scope of this article, but if you are getting into AI to solve a specific problem (as is often the case), I strongly recommend looking to see if an existing model/decoder architecture has been designed to solve your particular problem. It is often much easier to adjust your problem to fit an existing model architecture than it is to design a completely new architecture to fit your particular problem (trust me, I've tried this already and it was a Bad Idea).

The first one you see might even not be the best / state of the art out there - e.g. Transformers are better than the more widely used LSTMs. Surveying the landscape for your particular task (and figuring out how to frame it in the first place) is critical to the success of your model.

Easily write custom Tensorflow/Keras layers

At some point when working on deep learning models with Tensorflow/Keras for Python, you will inevitably encounter a need to use a layer type in your models that doesn't exist in the core Tensorflow/Keras for Python (from here on just simply Tensorflow) library.

I have encountered this need several times, and rather than e.g. subclassing tf.keras.Model, there's a much easier way - and if you just have a simple sequential model, you can even keep using tf.keras.Sequential with custom layers!

Background

First, some brief background on how Tensorflow is put together. The most important thing to remember is Tensorflow likes very much to compile things into native code using what we can think of as an execution graph.

In this case, by execution graph I mean a directed graph that defines the flow of information through a model or some other data processing pipeline. This is best explained with a diagram:

A simple stack of Keras layers illustrated as a directed graph

Here, we define a simple Keras AI model for classifying images which you might define with the functional API. I haven't tested this model - it's just to illustrate an example (use something e.g. like MobileNet if you want a relatively small model for image classification).

The layer stack starts at the top and works it's way downwards.

When you call model.compile(), Tensorflow complies this graph into native code for faster execution. This is important, because when you define a custom layer, you may only use Tensorflow functions to operate on the data, not Python/Numpy/etc ones.

You may have already encountered this limitation if you have defined a Tensorflow function with tf.function(some_function).

The reason for this is the specifics of how Tensorflow compiles your model. Now consider this graph:

A graph of Tensorflow functions

Basic arithmetic operations on tensors as well as more complex operators such as tf.stack, tf.linalg.matmul, etc operate on tensors as you'd expect in a REPL, but in the context of a custom layer or tf.function they operate on not a real tensor, but symbolic ones instead.

It is for this reason that when you implement a tf.function to use with tf.data.Dataset.map() for example, it only gets executed once.

Custom layers for the win!

With this in mind, we can relatively easily put together a custom layer. It's perhaps easiest to show a trivial example and then explain it bit by bit.

I recommend declaring your custom layers each in their own file.

import tensorflow as tf

class LayerMultiplier(tf.keras.layers.Layer):
    def __init__(self, multiplier=2, **kwargs):
        super(LayerMultiplier, self).__init__(**kwargs)

        self.param_multiplier = multiplier
        self.tensor_multiplier = tf.constant(multiplier, dtype=tf.float32)

    def get_config(self):
        config = super(LayerMultiplier, self).get_config()

        config.update({
            "multiplier": self.param_multiplier
        })

        return config

    def call(self, input_thing, training, **kwargs):
        return input_thing * self.tensor_multiplier

Custom layers are subclassed from tf.keras.layers.Layer. There are a few parts to a custom layer:

The constructor (__init__) works as you'd expect. You can take in custom (hyper)parameters (which should not be tensors) here and use then to control the operation of your custom layer.

get_config() must ultimately return a dictionary of arguments to pass to instantiate a new instance of your layer. This information is saved along with the model when you save a model e.g. with tf.keras.callbacks.ModelCheckpoint in .hdf5 mode, and then used when you load a model with tf.keras.models.load_model (more on loading a model with custom layers later).

A paradigm I usually adopt here is setting self.param_ARG_NAME_HERE fields in the constructor to the value of the parameters I've taken in, and then spitting them back out again in get_config().

call() is where the magic happens. This is called when you call model.compile() with a symbolic tensor which stands in for the shape of the real tensor to build an execution graph as explained above.

The first argument is always the output of the previous layer. If your layer expects multiple inputs, then this will be an array of (potentially symbolic) tensors rather then a (potentially symbolic) tensor directly.

The second argument is whether you are in training mode or not. You might not be in training mode if:

You are spinning over the validation dataset
You are making a prediction / doing inference
Your layer is frozen for some reason

Sometimes you may want to do something differently if you are in training mode vs not training mode (e.g. dataset augmentation), and Tensorflow is smart enough to ensure this is handled as you'd expect.

Note also here that I use a native multiplication with the asterisk * operator. This works because Tensorflow tensors (whether symbolic or otherwise) overload this and other operators so you don't need to call tf.math.multiply, tf.math.divide, etc explicitly yourself, which makes your code neater.

That's it, that's all you need to do to define a custom layer!

Using and saving

You can use a custom layer just like a normal one. For example, using tf.keras.Sequential:

import tensorflow as tf

from .components.LayerMultiply import LayerMultiply

def make_model(batch_size, multiplier):
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(96),
        tf.keras.layers.Dense(32),
        LayerMultiply(multiplier=5)
        tf.keras.layers.Dense(10, activation="softmax"),
    ])
    model.build(input_shape=tf.TensorShape([ batch_size, 32 ]))
    model.compile(
        optimizer="Adam",
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=[
            tf.keras.losses.SparseCategoricalAccuracy()
        ]
    )
    return model

The same goes here for the functional API. I like to put my custom layers in a components directory, but you can put them wherever you like. Again here, I haven't tested the model at all, it's just for illustrative purposes.

Saving works as normal, but for loading a saved model that uses a custom layer, you need to provide a dictionary of custom objects:

loaded_model = tf.keras.models.load_model(filepath_checkpoint, custom_objects={
    "LayerMultiply": LayerMultiply,
})

If you have multiple custom layers, define all the ones you use here. It doesn't matter if you define extra it seems, it'll just ignore the ones that aren't used.

Going further

This is far from all you can do. In custom layers, you can also:

Instantiate sublayers or models (tf.keras.Model inherits from tf.keras.layers.Layer)
Define custom trainable weights (tf.Variable)

Instantiating sublayers is very easy. Here's another example layer:

import tensorflow as tf

class LayerSimpleBlock(tf.keras.layers.Layer):
    def __init__(self, units, **kwargs):
        super(LayerSimpleBlock, self).__init__(**kwargs)

        self.param_units = units

        self.block = tf.keras.Sequential([
            tf.keras.layers.Dense(self.param_units),
            tf.keras.layers.Activation("gelu")
            tf.keras.layers.Dense(self.param_units),
            tf.keras.layers.LayerNormalization()
        ])

    def get_config(self):
        config = super(LayerSimpleBlock, self).get_config()

        config.update({
            "units": self.param_units
        })

        return config

    def call(self, input_thing, training, **kwargs):
        return self.block(input_thing, training=training)

This would work with a single sublayer too.

Custom trainable weights are also easy, but require a bit of extra background. If you're reading this post, you have probably heard of gradient descent. The specifics of how it works are out of scope of this blog post, but in short it's the underlying core algorithm deep learning models use to reduce error by stepping bit by bit towards lower error.

Tensorflow goes looking for all the weights in a model during the compilation process (see the explanation on execution graphs above) for you, and this includes custom weights.

You do, however, need to mark a tensor as a weight - otherwise Tensorflow will assume it's a static value. This is done through the use of tf.Variable:

tf.Variable(name="some_unique_name", initial_value=tf.random.uniform([64, 32]))

As far as I've seen so far, tf.Variable()s need to be defined in the constructor of a tf.keras.layers.Layer, for example:

import tensorflow as tf

class LayerSimpleBlock(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(LayerSimpleBlock, self).__init__(**kwargs)

        self.weight = tf.Variable(name="some_unique_name", initial_value=tf.random.uniform([64, 32]))

    def get_config(self):
        config = super(LayerSimpleBlock, self).get_config()
        return config

    def call(self, input_thing, training, **kwargs):
        return input_thing * weight

After you define a variable in the constructor, you can use it like a normal tensor - after all, in Tensorflow (and probably other deep learning frameworks too), tensors don't always have to hold an actual value at the time of execution as I explained above (I call tensors that don't contain an actual value like this symbolic tensors, since they are like stand-ins for the actual value that gets passed after the execution graph is compiled).

Conclusion

We've looked at defining custom Tensorflow/Keras layers that you can use without giving tf.keras.Sequential() or the functional API. I've shown how by compiling Python function calls into native code using an execution graph, many orders of magnitude of performance gains can be obtained, fully saturating GPU usage.

We've also touched on defining custom weights in custom layers, which can be useful depending on what you're implementing. As a side note, should you need a weight in a custom loss function, you'll need to define it in the constructor of a tf.keras.layers.Layer and then pull it out and pass it to your subclass of tf.keras.losses.Loss.

By defining custom Tensorflow/Keras layers, we can implement new cutting-edge deep learning logic that are easy to use. For example, I have implemented a Transformer with a trio of custom layers, and CBAM: Convolutional Block Attention Module also looks very cool - I might implement it soon too.

I haven't posted a huge amount about AI / deep learning on here yet, but if there's any topic (machine learning or otherwise) that you'd like me to cover, I'm happy to consider it - just leave a comment below.

NSD, Part 2: Dynamic DNS

Hey there! In the last post, I showed you how to setup nsd, the Name Server Daemon, an authoritative DNS server to serve records for a given domain. In this post, I'm going to talk through how to extend that configuration to support Dynamic DNS.

Normally, if you query, say, the A or AAAA records for a domain or subdomain like git.starbeamrainbowlabs.com, it will return the same IP address that you manually set in the DNS zone file, or if you use some online service then the value you manually set there. This is fine if your IP address does not change, but becomes problematic if your IP address may change unpredictably.

The solution, as you might have guessed, lies in dynamic DNS. Dynamic DNS is a fancy word for some kind of system where the host system that a DNS record points to (e.g. compute.bobsrockets.com) informs the DNS server about changes to its IP address.

This is done by making a network request from the host system to some kind of API that automatically updates the DNS server - usually over HTTP (though anything else could work too, but please make sure it's encrypted!).

You may already be familiar with using a HTTP API to inform your cloud-based registrar (e.g. Cloudflare, Gandi, etc) of IP address changes, but in this post we're going to set dynamic DNS up with the nsd server we configured in the previous post mentioned above.

The first order of business is to find some software to do this. You could also write a thing yourself (see also setting up a systemd service). There are several choices, but I went with dyndnsd (I may update this post if I ever write my own daemon for this).

Next, you need to determine what subdomain you'll use for dynamic dns. Since DNS is hierarchical, an entire subdomain is required - you can't just do dynamic DNS for, say, wiki.bobsrockets.com - since dyndnsd will manage it's own DNS zone file, all dynamic DNS hostnames will be under that subdomain - e.g. wiki.dyn.bobsrockets.com.

Configuring the server

For the server, I will be assuming that the dynamic dns daemon will be running on the same server as the nsd daemon.

For this tutorial, we'll be setting it up unencrypted. This is a security risk if you are setting it up to accept requests over the Internet rather than a local trusted network! Notes on how to fix this at the end of this post.

Since this is a Ruby-based program (which I do generally recommend avoiding since Ruby is generally an inefficient language to write a program in I've observed), first we need to install gem, the Ruby package manager:

sudo apt install ruby ruby-rubygems ruby-dev

Then, we can install the gem Ruby package manager:

sudo gem install dyndnsd

Now, we need to configure it. dyndnsd is configured using a YAML (ew) configuration file. It's probably best to show an example configuration file and explain it afterwards:

# listen address and port
host: "0.0.0.0"
port: 5354
# The internal database file. We'll create this in a moment.
db: "/var/lib/dyndnsd/db.json"
# enable debug mode?
debug: false
# all hostnames are required to be cool-name.dyn.bobsrockets.com
domain: "dyn.bobsrockets.com"
# configure the updater, here we use command_with_bind_zone, params are updater-specific
updater:
  name: "command_with_bind_zone"
  params:
    zone_file: "/etc/dyndnsd/zones/dyn.bobsrockets.com.zone"
    command: "systemctl reload nsd"
    ttl: "5m"
    dns: "bobsrockets.com."
    email_addr: "bob.bobsrockets.com"
# Users with the hostnames they are allowed to create/update
users:
  computeuser: # <--- Username
    password: "alongandrandomstring"
    hosts:
      - compute1.dyn.bobsrockets.com
  computeuser2:
    password: "anotherlongandrandomstring"
    hosts:
      - compute2.dyn.bobsrockets.com
      - compute3.dyn.bobsrockets.com

...several things to note here that I haven't already noted in comments.

zone_file: "/etc/nsd/zones/dyn.bobsrockets.com.zone": This is the path to the zone file dyndnsd should update.
dns: "bobsrockets.com.": This is the fully-qualified hostname with a dot at the end of the DNS server that will be serving the DNS records (i.e. the nsd server).
email_addr: "bob.bobsrockets.com": This sets the email address of the administrator of the system, but the @ at sign is replaced with a dot .. If your email address contains a dot . in the user (e.g. bob.rockets@example.com), then it won't work as expected here.

Also important here is that although when dealing with domains like this it is less confusing to always require a dot . at the end of fully qualified domain names, this is not always the case here.

Once you've written the config file,, create the directory /etc/dyndnsd and write it to /etc/dyndnsd/dyndnsd.yaml.

With the config file written, we now need to create and assign permissions to the data directory it will be using. Do that like so:

sudo useradd --no-create-home --system --home /var/lib/dyndnsd dyndnsd
sudo mkdir /var/lib/dyndnsd
sudo chown dyndnsd:dyndnsd /var/lib/dyndnsd

Also, we need to create the zone file and assign the correct permissions so that it can write to it:

sudo mkdir /etc/dyndnsd/zones
sudo chown dyndnsd:dyndnsd /etc/dyndnsd/zones
# symlink the zone file into the nsd zones directory. This way dyndns isn't allowed to write to all of /etc/nsd/zones - just the 1 zone file it is supposed to update.
sudo ln -s /etc/dyndnsd/zones/dyn.bobsrockets.com.zone /etc/nsd/zones/dyn.bobsrockets.com.zone

Now, we can write a systemd service file to run dyndnsd for us:

[Unit]
Description=dyndnsd: Dynamic DNS record updater
Documentation=https://github.com/cmur2/dyndnsd

[Service]
User=dyndnsd
Group=dyndnsd
ExecStart=/usr/local/bin/dyndnsd /etc/dyndnsd/dyndnsd.yaml
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=dyndnsd

[Install]
WantedBy=multi-user.target

Save this to /etc/systemd/system/dyndnsd.service. Then, start the daemon like so:

sudo systemctl daemon-reload
sudo systemctl enable --now dyndnsd.service

Finally, don't forget to update your firewall to allow requests through to dyndnsd. For UFW, do this:

sudo ufw allow 5354/tcp comment dyndnsd

That completes the configuration of dyndnsd on the server. Now we just need to update the nsd config file to tell it about the new zone.

nsd's config file should be at /etc/nsd/nsd.conf. Open it for editing, and add the following to the bottom:

zone:
    name: dyn.bobsrockets.com
    zonefile: dyn.bobsrockets.com.zone

...and you're done on the server!

Configuring the client(s)

For the clients, all that needs doing is configuring them to make regular requests to the dyndnsd server to keep it appraised of their IP addresses. This is done by making a HTTP request, so we can test it with curl like this:

curl http://computeuser:alongandrandomstring@bobsrockets.com:5354/nic/update?hostname=compute1.dyn.bobsrockets.com

...where computeuser is the username, alongandrandomstring is the password, and compute1.dyn.bobsrockets.com is the hostname it should update.

The server will be able to tell what the IP address is it should set for the subdomain compute1.dyn.bobsrockets.com by the IP address of the client making the request.

The simplest way of automating this is using cron. Add the following cronjob (sudo crontab -e to edit the crontab):

*/5 * * * *     curl -sS http://computeuser:alongandrandomstring@bobsrockets.com:5354/nic/update?hostname=compute1.dyn.bobsrockets.com

....and that's it! It really is that simple. Windows users will need to setup a scheduled task instead and install curl, but that's outside the scope of this post.

Conclusion

In this post, I've given a whistle-stop tour of setting up a simple dynamic dns server. This can be useful if a host as a dynamic IP address on a local network but it still needs a (sub)domain for some reason.

Note that this is not suitable for untrusted networks! For example, setting dyndnsd to accept requests over the Internet is a Bad Idea, as this simple setup is not encrypted.

If you do want to set this up over an untrusted network, you must encrypt the connection to avoid nasty DNS poisoning attacks. Assuming you already have a working reverse proxy setup on the same machine (e.g. Nginx), you'll need to add a new virtual host (a server { } block in Nginx) that reverse-proxies to your dyndnsd daemon and sets the X-Real-IP HTTP header, and then ensure port 5354 is closed on your firewall to prevent direct access.

This is beyond this scope of this post and slightly different depending on your setup, but if there's the demand I can blog about how to do this.

Sources and further reading

The NSD Authoritative DNS Server: What, why, and how

In a previous blog post, I explained how to setup unbound, a recursive resolving DNS server. I demonstrated how to setup a simple split-horizon DNS setup, and forward DNS requests to an upstream DNS server - potentially over DNS-over-TLS.

Recently, for reasons that are rather complicated, I found myself in an awkward situation which required an authoritative DNS server - and given my love of explaining complicated and rather niche concepts here on my blog, I thought this would be a fabulous opportunity to write a 2-part series :P

In this post, I'm going to outline the difference between a recursive resolver and an authoritative DNS server, and explain why you'd want one and how to set one up. I'll explain how it fits as a part of a wider system.

Go grab your snacks - you'll be learning more about DNS than you ever wanted to know....

DNS in a (small) nutshell

As I'm sure you know if you're reading this, DNS stands for the Domain Name System. It translates domain names (e.g. starbeamrainbowlabs.com.) into IP addresses (e.g. 5.196.73.75, or 2001:41d0:e:74b::1). Every network-connected system will make use of a DNS server at one point or another.

DNS functions on records. These define how a given domain name should be resolved to it's corresponding IP address (or vice verse, but that's out-of-scope of this post). While there are many different types of DNS record, here's a quick reference for the most common one's you'll encounter when reading this post.

A: As simple as it gets. An A record defines the corresponding IPv4 address for a domain name.
AAAA: Like an A record, but for IPv6.
CNAME: An alias, like a symlink in a filesystem [Linux] or a directory junction [Windows]
NS: Specifies the domain name of the authoritative DNS server that holds DNS records for this domain. See more on this below.

A tale of 2 (DNS) servers

Consider your laptop, desktop, phone, or other device you're reading this on right now. Normally (if you are using DHCP, which is a story for another time), your router (which usually acts as the DHCP server on most home networks) will tell you what DNS server(s) to use.

These servers that your device talks to is what's known as a recursive resolving DNS server. These DNS servers do not have any DNS records themselves: their entire purpose is to ask other DNS servers to resolve queries for them.

At first this seems rather counterintuitive. Why bother when you can have a server that actually hosts the DNS records themselves and just ask that every time instead?

Given the size of the Internet today, this is unfortunately not possible. If we all used the same DNS server that hosted all DNS records, it would be drowned in DNS queries that even the best Internet connection would not be abel to handle. It would also be a single point of failure - bringing the entire Internet crashing down every time maintenance was required.

To this end, a more scaleable system was developed. By having multiple DNS servers between users and the authoritative DNS servers that actually hold the real DNS records, we can ensure the system scales virtually infinitely.

The next question that probably comes to mind is where the name recursive resolvers DNS server comes from. This name comes from the way that these recursive DNS servers ask other DNS servers for the answer to a query, instead of answering based on records they hold locally (though most recursive resolving DNS servers also have a cache for performance, but this is also a tale for another time).

Some recursive resolving DNS servers - such as the one built into your home router - simply asks 1 or 2 upstream DNS servers - usually either provided by your ISP or manually set by you (I recommend 1.1.1.1/1.0.0.1), but others are truly recursive.

Take peppermint.mooncarrot.space. for example. If we had absolutely no idea where to start resolving this domain, we would first ask a DNS root server for help. Domain names are hierarchical in nature - sub.example.com. is a subdomain of example.com.. The same goes then that mooncarrot.space. is a subdomain of space., which is itself a subdomain of ., the DNS root zone. It is no accident that all the domain names in this blog post have a dot at the end of them (try entering starbeamrainbowlabs.com. into your browser, and watch as your browser auto-hides the trailing dot .).

In this way, if we know the IP address of a DNS root server (e.g. 193.0.14.129, or 2001:7fd::1), we can recurse through this hierarchical tree to discover the IP address associated with a domain name we want to resolve

First, we'd ask a root server to tell us the authoritative DNS server for the space. domain name. We do this by asking it for the NS record for the space. domain.

Once we know the address of the authoritative DNS server for space., we can ask it to give us the NS record for mooncarrot.space. for us. We may repeat this process a number of times - I'll omit the specific details of this for brevity (if anyone's interested, I can write a full deep dive post into this, how it works, and how it's kept secure - comment below) - and then we can finally ask the authoritative DNS server we've tracked down to resolve the domain name peppermint.mooncarrot.space. to an IP address for us (e.g. by asking for the associated A or AAAA record).

Authoritative DNS servers

With this in mind, we can now move on to the main purpose of this post: setting up an authoritative DNS server. As you might have guessed by now, the purpose of an authoritative DNS server is to hold records about 1 or more domain names.

While most of the time the authoritative DNS server for your domain name will be either your registrar or someone like Cloudflare, there are a number of circumstances in which it can be useful to run your own authoritative DNS server(s) and not rely on your registrar:

If you need more control over the DNS records served for your domain than your registrar provides
Serving complex DNS records for a domain name on an internal network (split-horizon DNS)
Setting up your own dynamic DNS system (i.e. where you dynamically update the IP address(es) that a domain name resolves to via an API call)

Other situations certainly exist, but these are 2 that come to mind at the moment (comment below if you have any other uses for authoritative DNS servers).

The specific situation I found myself was a combination of the latter 2 points here, so that's the context in which I'll be talking.

To set one up, we first need some software to do this. There are a number of DNS servers out there:

Bind9 [recursive; authoritative]
Unbound [recursive; not really authoritative; my favourite]
Dnsmasq [recursive]
systemd-resolved [recursive; it always breaks for me so I don't use it]

As mentioned Unbound is my favourite, so for this post I'll be showing you how to use it's equally cool sibling, nsd (Name Server Daemon).

The Name Server Daemon

Now that I've explained what an authoritative DNS server is and why it's important, I'll show you how to install and configure one, and then convince another recursive resolving DNS server that's under your control to ask your new authoritative DNS server instead of it's default upstream to resolve DNS queries for a given domain name.

It goes without saying that I'll be using Linux here. If you haven't already, I strongly recommend using Linux for hosting a DNS server (or any other kind of server). You'll have a bad day if you don't.

I will also be assuming that you have a level of familiarity with the Linux terminal. If you don't learn your terminal and then come back here.

nsd is available in all major distributions of Linux in the default repositories. Adjust as appropriate for your distribution:

sudo apt install nsd

nsd has 2 configuration files that are important. First is /etc/nsd/nsd.conf, which configures the nsd daemon itself. Let's do this one first. If there's an existing config file here, move it aside and then paste in something like this:

server:
    port: 5353

    server-count: 1
    username: nsd

    logfile: "/var/log/nsd.log"
    pidfile: "/run/nsd.pid"

    # The zonefile directive(s) below is prefixed by this path
    zonesdir: /etc/nsd/zones

zone:
    name: example.com
    zonefile: example.com.zone

...replace example.com with the domain name that you want the authoritative DNS server to serve DNS records for. You can also have multiple zone: blocks for different (sub)domains - even if those domain names are subdomains of others.

For example, I could have a zone: block for both example.com and dyn.example.com. This can be useful if you want to run your own dynamic DNS server, which will write out a full DNS zone file (a file that contains DNS records) without regard to any other DNS records that might have been in that DNS zone.

Replace also 5353 with the port you want nsd to listen on. In my case I have my authoritative DNS server running on the same box as the regular recursive resolver, so I've had to move the authoritative DNS server aside to a different port as dnsmasq (the recursive DNS server I have running on this particular box) has already taken port 53.

Next up, create the directory /etc/nsd/zones, and then open up example.com.zone for editing inside that new directory. In here, we will put the actual DNS records we want nsd to serve.

The format of this file is governed by RFC1035 section 5 and RFC1034 section 3.6.1, but the nsd docs provide a simpler example. See also the wikipedia page on DNS zone files.

Here's an example:

; example.com.
$TTL 300
example.com. IN     SOA    a.root-servers.net. admin.example.com. (
                2022090501  ; Serial
                3H          ; refresh after 3 hours
                1H          ; retry after 1 hour
                1W          ; expire after 1 week
                1D)         ; minimum TTL of 1 day

; Name Server
IN  NS  dns.example.com.

@                   IN A        5.196.73.75
example.com.        IN AAAA     2001:41d0:e:74b::1
www                 IN CNAME    @
ci                  IN CNAME    @

Some notes about the format to help you understand it:

Make sure ALL your fully-qualified domain names have the trailing dot at the end otherwise you'll have a bad day.
$TTL 300 specifies the default TTL (Time To Live, or the time DNS records can be cached for) in seconds for all subsequent DNS records.
Replace example.com. with your domain name.
admin.example.com. should be the email address of the person responsible for the DNS zone file, with the @ replaced with a dot instead.
dns.example.com. in the NS record must be set to the domain name of the authoritative DNS server serving the zone file.
@ IN A 5.196.73.75 is the format for defining an A record (see the introduction to this blog post) for example.com. - @ is automatically replaced with the domain name in question - in this case example.com.
When declaring a record, if you don't add the trailing dot then it is assumed you're referring to a subdomain of the domain this DNS zone file is for - e.g. if you put www it assumes you mean www.example.com.

Once you're done, all that's left for configuring nsd is to start it up for the first time (and on boot). Do that like so:

sudo systemctl restart nsd
sudo systemctl enable nsd

Now, you should be able to query it to test it. I like to use dig for this:

dig -p 5353 +short @dns.example.com example.com

...this should return a result based on the DNS zone file you defined above. Replace 5353 with the port number your authoritative DNS server is running on, or omit -p 5353 altogether if it's running on port 53.

Try it out by updating your DNS zone file and reloading nsd: sudo systemctl reload nsd

Congratulations! You now have an authoritative DNS server under your control! This does not mean that it will be queried by any other DNS servers on your network though - read on.....

Integration with the rest of your network

The final part of this post will cover integrating an authoritative DNS server with another DNS server on your network - usually a recursive one. How you do this will vary depending on the target DNS server you want to convince to talk to your authoritative DNS server.

For Unbound:

I've actually covered this in a previous blog post. Simply update /etc/unbound/unbound.conf with a new block like this:

forward-zone:
    name: "example.com."
    forward-addr: 127.0.0.1@5353

...where example.com. is the domain name to forward for (WITH THE TRAILING DOT; and all subdomains thereof), 127.0.0.1 is the IP address of the authoritative DNS server, and 5353 is the port number of the authoritative DNS server.

Then, restart Unbound like so:

sudo systemctl restart unbound

For dnsmasq:

Dnsmasq's main config file is located at /etc/dnsmasq.conf, but there may be other config files located in /etc/dnsmasq.d/ that might interfere. Either way, update dnsmasq's config file with this directive:

server=/example.com./127.0.0.1#5353

If there's another server=/example.com./... directive elsewhere in your dnsmasq config, it may override your new definition.

Then, restart dnsmasq like so:

sudo systemctl restart dnsmasq

If there's another DNS server that I haven't included here that you use, please leave a comment on how to reconfigure it to forward a specific domain name to a different DNS server.

Conclusion

In this post, I've talked about the difference between an authoritative DNS server and a recursive resolving DNS server. I've shown why authoritative DNS servers are useful, and alluded to reasons why running your own authoritative DNS server can be beneficial.

In the second post in this 2-part miniseries, I'm going to go into detail on dynamic DNS, why it's useful, and how to set up a dynamic dns server.

As always, this blog post is a starting point - not an ending point. DNS is a surprisingly deep subject: from DNS root hint files to mDNS (multicast DNS) to the various different DNS record types, there are many interesting and useful things to learn about it.

After all, it's always DNS..... especially when you don't think it is.

Sources and further reading

Learn your terminal
Cluster, Part 3: Laying groundwork with Unbound as a DNS server
Root hints - a collection of operational and configuration FAQs
DNS Root Servers - IANA
What is a DNS record? - Cloudflare
DNS cache poisoning - Cloudflare
Multicast DNS
DNS servers:
- Bind9 [complicated to configure]
- Unbound [my favourite; also known for being secure I think]
- nsd [my favourite; sibling of Unbound - they work very well together]
- dnsmasq [haven't had much experience with it, but great when it works]
- systemd-resolved [recursive; it always breaks for me so I don't use it]
DNS Zone files:
- Wikipedia
- nsd docs
- Official spec: RFC1035 section 5 and RFC1034 section 3.6.1

Mounting LVM partitions from the terminal on Linux

Hello there! Recently I found myself with the interesting task of mounting an LVM partition by hand. It wasn't completely straightforward and there was a bunch of guesswork involved, so I thought I'd document the process here.

For those who aren't aware, LVM stands for the Logical Volume Manager, and it's present on Linux system to make managing partitions easier. It can:

Move and resize partitions while they are still mounted
Span multiple disks

....but to my knowledge it doesn't have any redundancy (use Btrfs) or encryption (use LUKS) built in. It is commonly used to manage the partitions on your Linux desktop, as then you don't need to reboot it into a live Linux environment to fiddle with your partitions as much.

LVM works on a layered system. There are 3 layers to it:

Physical Volumes: Normal physical partitions on the disk.
Volume Groups: Groups of logical (LVM) partitions.
Logical Volumes: LVM-managed partitions.

In summary, logical volumes are part of a volume group, which spans 1 or more physical disks.

With this in mind, first list the available physical volumes and their associated volume groups, and identify which is the one you want to mount:

sudo vgdisplay

Notice the VG Size in the output. Comparing it with the output of lsblk -o NAME,RO,SIZE,RM,TYPE,MOUNTPOINT,LABEL,VENDOR,MODEL can be helpful to identify which one is which.

I encountered a situation where I had 2 with the same name - one from my host system I was working on, and another from the target disk I was trying to mount. In my situation each disk had it's own volume group assigned to it, so I needed to rename one of the volumes.

To do this, take the value of the VG UUID field of the volume group you want to rename from the output of sudo vgdisplay above, and then rename it like this:

sudo vgrename SOME_ID NEW_NAME

...for example, I did this:

sudo vgrename 5o1LoG-jFdv-v1Xm-m0Ca-vYmt-D5Wf-9AAFLm examplename

With that done, we can now locate the logical volume we want to mount. Do this by listing the logical volumes in the volume group you're interested in:

sudo lvdisplay vg_name

Note down the name of the logical volume you want to mount. Now we just need to figure out where it is actually located in /dev so that we can mount it. Despite the LV Path field appearing to show us this, it's not actually correct - at least on my system.

Instead, list the contents of /dev/mapper:

ls /dev/mapper

You should see the name of the logical volume that you want to mount in the form volumegroup-logicalvolumename. Once found, you should be able to mount it like so:

sudo mount /dev/mapper/volumegroup-logicalvolumename path/to/directory

...replacing path/to/directory with the path to the (empty) directory you want to mount it to.

If you can't find it, then it is probably because you plugged the drive in question in after you booted up. In this case, it's probable that the volume group is not active. You can check this is the case or not like so:

sudo lvscan

If it isn't active, then you can activate it like this:

sudo lvchange -a y vg_name

...replacing vg_name with the name of the volume group you want to activate. Once done, you can then mount the logical volume as I mentioned above.

Once you are done, unmounting it is a case of reversing these steps. First, unmount the partition:

sudo umount path/to/mount_point

Then, disable the volume group again:

sudo lvchange -a n vg_name

Finally, flush any cached writes to disk, just in case:

sync

Now, you can unplug the device from your machine.

That wraps up this quick tutorial. If you spot any mistakes in this, please do leave a comment below and I'll correct it.

PhD Update 14: An old enemy

Hello again! This post is rather late due to one thing and another, but I've finally gotten around to writing it. In the last post, I talked about the CLIP model I trained to predict sentiment using both twitter and their associated images in pairs, and the augmentation system I devised to increase the size of the dataset. I also talked about the plan for a next-generation rainfall radar model, and a journal article I'm writing.

Before we begin though, let's start with the customary list of previous posts:

Since that last post, I've pretty much finished my initial draft of the journal article - though it is rather overlength, and I've also made a significant start on the rainfall radar model, which is what I will be focusing on in this blog post as there isn't all that much to talk about with the journal article at the moment (I'm unsure how much I'm allowed to share). I will make a separate post when I (finally) publish the journal article.

Rainfall radar model, revisited

As you might remember, I have dealt with rainfall radar data before (exhibit A, B, C, D), and it didn't go too well. After the part of my PhD on social media, I have learnt a lot about AI models and how to build them. I have also learnt a lot about data preprocessing. With all this in hand, I am now better equipped to do battle once more with an old enemy: the 1.5M time step rainfall radar dataset.

For those who are somewhat confused, the dataset in question is in 2 dimensions (i.e. like greyscale images). It is comprised of 3 things:

A heightmap
Rainfall radar data every 5 minutes
Water depth information, calculated by HAIL-CAESAR and binarised to water / no water for each pixel with a simple threshold

Given that the rainfall radar dataset has an extremely restrictive licence, I am unfortunately unable to share sample images from the dataset here.

My first objective was to tame the beast. To do this, I needed to convert the data to .tfrecord.gz files (applying all the preprocessing transformations ahead of time) instead of the split .asc.stream.gz and .jsonl.gz files I was using. At first, I thought I could use a TextLineDataset (it even supports reading from gzipped files!), but the snag here is that Tensorflow does not have a JSON parsing function.

The reason this is a problem is due to the new way I am parsing my dataset. Before, I used tf.data.Dataset.from_generator() and a regular Python function, but I have since discovered that there is a much more efficient way of doing things. The key revelation here was that Tensorflow does not just simply execute e.g. your custom layers you implement and call .call() each time. No, instead it calls it once and constructs a graph of operations, before then compiling this into machine code that the GPU can understand. The implication of this is twofold:

It is significantly more efficient to take advantage of Tensorflow's execution graph functionality where available
Once your (any part of) dataset becomes a Tensor, it must stay a Tensor

This not only goes for custom layers, loss functions, etc, but it also goes for the dataset pipeline too! I strongly recommend using the .map() function on tf.data.Dataset with a tf.function. Avoid .from_generation() if you can possibly help it!

To take advantage of this, I needed to convert my dataset to a set of .tfrecord.gz files (to support parallel reading, esp. since Viper has a high read latency). Given my code to parse my dataset is in Javascript/Node.js, I first tried using the tfrecord npm package to write .tfrecord files in Javascript directly. This did not work out though, as it kept crashing. I also tried variant packages like tfrecords and tfrecord-stream and more, but none of them worked. In the end, I settled on a multi-step process:

Convert split data into .jsonl.gz files, 4K records per file. Do all preprocessing / correction steps here.
Make all records unique: hash all records in all files, mark records for deletion, then delete them from files
Recompress .jsonl.gz files to 4K records per file
Convert .jsonl.gz → .tfrecord.gz with Python child processes managed by Node.js

Overcomplicated? Perhaps. Do I have a single command I can execute to do all of this? Nope! Does it work? Absolutely :P

With the data converted I turned my attention to the model itself. As I have discussed previously, my current hypothesis is that the previous models failed because the relationship between the rainfall radar and water depth data is non-obvious (and that the model designs were terrible. 5K parameters? hahahaha, 5M parameters is probably the absolute minimum I would need). To this end, I will be first training a contrastive learning model to find relationships between the dataset items. Only then will I train a model to predict water depth, which I'll model as an image segmentation task (I have yet to find a segmentation decoder to implement, so suggestions here are welcome).

The first step here is to implement the contrastive learning algorithm. This is non-trivial however, so I implemented a test model using images from Reddit (r/cats, r/fish, and r/dogs) to test it and test the visualisations that I will require to determine the effectiveness of the model. In doing this, I found that the algorithm for contrastive learning in the CLIP paper (Learning Transferable Visual Models From Natural Language Supervision) was wrong and completely different to that which is described in the code, and I couldn't find the training loop or core loss function at all - so I had to piece together something from a variety of different sources.

To visualise the model, I needed a new approach. While the loss function value over time plotted on a graph is useful, it's difficult to tell if the resulting embedded representation the model outputs is actually doing what it is supposed to. There Reading online, there are 2 ways of visualising embedding representations I've found:

Dimensionality reduction
Parallel coordinates plot

I can even include here a cool plot that demonstrates both of them with the pretrained CLIP model I used in the social media half of my project:

The second one is the easier to explain so I'll start with that. If you imagine that the output of the model is of shape [ batch_size, embedding_dim ] / [ 64, 200 ], then for every record in the dataset we can plot a line across a set of vertical lines, where each vertical line stands for each successive point in the dataset. This is what I have done in the plot on the right there.

The plot on the left uses the UMAP dimensionality reduction algorithm (paper), which to my knowledge is the best dimensionality reduction algorithm out there at the moment. For the uninitiated, a dimensionality reduction algorithm takes a vector with many dimensions - such one with an embedding dimension of size 200 - and converts it into a lower-dimensional value (e.g. in 2 or 3 dimensions most commonly) so that it can be plotted and visualised. This is particularly helpful in AI when you want to check if your model is actually doing what you expect.

I took some time to look into this, as there are a number of other algorithms out there and it seems like it's far too easy to pick the wrong one for the task. In short, there are 3 different algorithms you'll see most often:

PCA: Stands for Principled Component Analysis, and while popular it does not support non-linear transformations, which is most AI models.
tSNE: A non-linear alternative (designed for AI applications, in part) that is also rather popular. It does not preserve the global structure of the dataset (i.e. relationships and distances between different values) very well though.
UMAP: Stands for Uniform Manifold Approximation and Projection. It is designed as an alternative to tSNE and preserves global structure much better.

Sources for this are at the end of this post. If you're applying PCA or tSNE for dimensionality reduction in an AI context, consider switching it out to UMAP.

In the plot above, it is obvious that the pretrained CLIP model can differentiate between the 3 types of pet that I gave it as a test dataset. The next step was to train a model with the contrastive learning and the test dataset.

To do this, I needed an encoder. In the test, I used ResNetV2, which is apparently an improved version of the ResNet architecture (I have yet to read the paper on it). Since I implemented it though, I discovered an implementation of the state-of-the-art image encoder ConvNeXt (paper) that I discovered recently, so I'm using that in the main model. See my recent post on my image captioning project for more details on image encoders, but in short to the best of my knowledge ConvNeXt is the current state of the art.

Any, when I plot the output of this model it gave me this plot:

I notice a few issues with this. Firstly and most obviously, the points are all jumbled up! It has not learnt the difference between cats, fish, and dogs. I suspect this is because the input to the test model I trained got 2 variants of the same image altered randomly in different ways (flipping, hue change, etc) rather than an image and a textual label. I'm not too worried though, 'cause the real model will have 2 different items as inputs - I was avoiding doing extra work here.

Secondly, the parallel coordinates plot does not show a whole lot of variance between the different items. This is more worrying, but I'm again hoping that this issue will fix itself when I give the model 'real pairs' of rainfall radar <-> water depth images (with the heightmap thrown in there somewhere probably, I haven't decided yet).

Finally, I plotted a UMAP graph with completely random points to ensure it represented them properly:

As you can see, it plots them in a roughly spherical shape with no clear form or separation between the points. I'm glad I did this, because at first I was passing the labels to the UMAP plotter in the wrong way, and it instead artificially moved the points into groups.

With the test model done, I have moved swiftly on to (pretraining) actual model itself. This is currently underway so I don't have anything to show just yet (it is still training and I have yet to implement code to plot the output), but I can say that thanks to my realisations in Tensorflow graph execution as tensors, I'm seeing a GPU utilisation of 95% and above at all times :D

Conclusion

I've got a journal article written, but it's overlength so my job there isn't quite done just yet. When it is published, I will definitely make a dedicated post here!

Now, I have moved from writing to implementing a new model to tackle the rainfall radar part of my project. By using contrastive learning, I hope to enable the model to learn the relationship between the rainfall radar data and the water depth information. Once I've trained a contrastive learning model, I'll attach and train another model for image segmentation to predict the water depth information.

If you know of any state-of-the-art image segmentation decoder AI architectures, please leave a comment below. Bonus points if I can configure it to have >= 5M parameters without running out of memory. I'm currently very unsure what I'm going to choose.

Additionally, if you have any suggestions for additional tests I can do to verify my contrastive learning model is actually learning something, please leave a comment below also. The difficulty ist hat the while the loss value goes down, it's extremely difficult to tell whether what it's learning is actually sensible or not.

The plan to caption and index images

Something that has been on my mind for a while are the photos that I take. At last count on my NAS I have 8564 pictures I have taken so far since I first got a phone to take them with, and many more belonging to other family members.

I have blogged before about a script I've written that automatically processes photos graphs and files them in by year and month. It fixes the date taken, set the thumbnail for rapid preview loading, automatically rotates them to be the right way up, losslessly optimises them, and more.

The one thing it can't do though is to help me locate a specific photo I'm after, so given my work with AI recently I have come up with a plan to do something about this, and I want to blog about it here.

By captioning the images with an AI, I plan to index the captions (and other image metadata) and have a web interface in the form of a search engine. In this blog post, I'm going to outline the AI I intend to use, and the architecture of the image search engine I have already made a start on implementing.

AI for image captioning

The core AI to do image captioning will be somewhat based on work I've done for my PhD. The first order of business was finding a dataset to train on, and I stumbled across Microsoft's Common Objects in Context dataset. The next and more interesting part was to devise a model architecture that translate an image into text.

When translating 1 thing (or state space) into another in AI, it is generally done with an encoder-decoder architecture. In my case here, that's an encoder for the image - to translate it into an embedded feature space - and a decoder to turn that embedded feature space into text.

There are many options for these - especially for encoding images - which I'll look at first. While doing my PhD, I've come across many different encoders for images, which I'd roughly categorise into 2 main categories:

Convolutional Neural Network (CNN) based models
Transformer based models, i.e. a Swin Transformer

Since the transformer model was invented, they have been widely considered to be the best option. Swin Transformers adapt this groundbreaking design for images - transformers originally handled text - from what I can tell better than the earlier Vision Transformer architecture.

On the other side, a number of encoders were invented before transformers were a thing - the most famous of which was ResNet (I think I have the right paper), which was basically just a bunch of CNN layers stacked on top of one another with a few extra bits like normalisation and skip connections.

Recently though, a new CNN-based architecture that draws inspiration from the strong points of transformers - and it's called ConvNeXt. Based on the numbers in the paper, it even beats the swin transformer model mentioned earlier. Best of all, it's much simpler in design so it makes it relatively easy to implement. It is this model architecture I will be using.

For the text, things are both straight forward - the model architecture I'll be using is a transformer (of course - I even implemented it myself from scratch!) - but the trouble is representation. Particularly the representation of the image caption we want the model to predict.

There are many approaches to this problem, but the one I'm going to try first is a word-based solution using one-hot encoding. There are about 27K different unique words in the dataset, so I've assigned each one a unique number in a dictionary file. Then, I can turn this:

[ "a", "cat", "sat", "on", "a", "mat" ]

....into this:

[ 0, 1, 2, 3, 0, 4 ]

...then, the model would predict something like this:

[
    [ 1, 0, 0, 0, 0, 0, ... ],
    [ 0, 1, 0, 0, 0, 0, ... ],
    [ 0, 0, 1, 0, 0, 0, ... ],
    [ 0, 0, 0, 1, 0, 0, ... ]
    [ 1, 0, 0, 0, 0, 0, ... ]
    [ 0, 0, 0, 0, 1, 0, ... ]
]

...where each sub-array is a word.

This will as you might suspect use a lot of memory - especially with 27K words in the dictionary. By my calculations, with a batch size of 64 and a maximum caption length of 25, each output prediction tensor will use a whopping 172.8 MiB memory as float32, or 86.4 MiB memory as float16 (more on memory usage later).

I'm considering a variety of techniques to combat this if it becomes an issue. For example, reducing the dictionary size by discarding infrequently used words.

Another option would be to have the model predict GloVe vectors as an output and then compare the output to the GloVe dictionary to tell which one to pick. This would come with it's own set of problems however, like lots of calculations to compare each word to every word in the dictionary.

My final thought was that I could maybe predict individual characters instead of full words. There would be more items in the sequence predicted, but each character would only have up to 255 choices (probably more like 36-ish), potentially saving memory.

I have already implemented this AI - I just need to debug and train it now. To summarise, here's a diagram:

The last problem with the AI though is memory usage. I plan on eventually running the AI on a raspberry pi, so much tuning will be required to reduce memory usage and latency as much I can. In particular, I'll be trying out quantisating my model and writing the persistent daemon to use Tensorflow Lite to reduce memory usage. Models train using the float32 data type - which uses 32 bits per value, but quantising it after training to use float16 (16 bits / value) or even uint8 (8 bits / value) would significantly reduce memory usage.

Search engine and indexing

The second part of this is the search engine. The idea here is to index all my photos ahead of time, and then have a web interface I can use to search and filter them. The architecture I plan on using to achieve this is rather complicated, and best explained with a diagram:

The backend I have chosen for the index is called meilisearch. It's written in Rust, and provides advanced as-you-type search functionality. This is for 2 reasons:

While I'd love to implement my own, meilisearch is an open source project where they have put in more hours into making it cool than I ever would be able to
Being a separate daemon means I can schedule it on my cluster as a separate task, which potentially might end up on a different machine

With this in mind, the search engine has 2 key parts to it: the crawler / indexer, and the HTTP server that serves the web interface. The web interface will talk to meilisearch to perform searches (not directly; requests will be proxied and transformed).

The crawler will periodically scan the disk for new, updated, and deleted files, and pass them on to the indexer queue. The indexer will do 4 things:

Caption the image, by talking to a persistent Python child process via Inter Process Communication (IPC) - captions will be written as EXIF data to images
Thumbnail images and store them in a cache (perhaps some kinda key-value store, as lots of little files on disk would be a disaster for disk space)
Extract EXIF (and other) metadata
Finally, push the metadata to meilisearch for indexing

Tasks 2 and 3 can be done in parallel, but the others will need to be done serially - though multiple images can of course be processed concurrently. I anticipate much asynchronous code here, which I'm rather looking forward to finishing writing :D

I already have a good start on the foundation of the search engine here. Once I've implemented enough that it's functional, I'll open source everything.

To finish this post, I have a mockup screenshot of what the main search page might look like:

Obviously the images are all placeholders (append ?help to this URL see the help page) for now and I don't yet have a name for it (suggestions in the comments are most welcome!), but the rough idea is there.

Configuring an endlessh honeypot with rsyslog email notifications

Security is all about defence in depth, so I'm always looking for ways to better secure my home network. For example, I have cluster management traffic running over a Wireguard mesh VPN. Now, I'm turning my attention to the rest of my network.

To this end, while I have a guest network with wireless isolation enabled, I do not currently have a way to detect unauthorised devices connecting to my home WiFi network, or fake WiFi networks with the same name, etc. Detecting this is my next focus. While I've seen nzyme recently and it looks fantastic, it also looks more complicated to setup.

While I look into the documentation for nzyme, inspired by this reddit post I decided to setup a honeypot on my home network.

The goal of a honeypot is to detect threats moving around in a network. In my case, I want to detect if someone has connected to my network who shouldn't have done. Honeypots achieve this by pretending to be a popular service, but in reality they are there to collect information about potential threats.

To set one up, I found endlessh, which pretends to be an SSH server - but instead slowly sends an endless banner to the client, keeping the connection open as long as possible. It can also connection attempts to syslog, which allows us to detect connections and send an alert.

Implementing this comes in 2 steps. First, we setup endlessh and configure it to log connection attempts. Then, we reconfigure rsyslog to send email alerts.

Setting up endlessh

I'm working on one of the Raspberry Pis running Raspberry Pi OS in my network, but this should with with other machines too.

If you're following along to implement this yourself, make sure you've moved SSH to another port number before you continue, as we'll be configuring endlessh to listen on port 22 - the default port for ssh, as this is the port I imagine that an automated network scanner might attempt to connect to by default if it were looking for ssh servers to attempt to crack.

Conveniently, endlessh has a package in the default Debian repositories:

sudo apt install endlessh

...adjust this for your own package manager if you aren't on an apt-based system.

endlessh has a configuration file at /etc/endlessh/config by default. Open it up for editing, and make it look something like this:

# The port on which to listen for new SSH connections.
Port 22

# Set the detail level for the log.
#   0 = Quiet
#   1 = Standard, useful log messages
#   2 = Very noisy debugging information
LogLevel 1

Beforee we can start the endlessh service, we need to reconfigure it to allow it to listen on port 22, as this is a privileged port number. Doing this requires 2 steps. First, allow the binary to listen on privileged ports:

sudo setcap CAP_NET_BIND_SERVICE=+eip "$(which "endlessh")";

Then, if you are running systemd (most distributions do by default), execute the following command:

sudo systemctl edit endlessh.service

This will allow you to append some additional directives to the service definition for endlessh, without editing the original apt-managed systemd service file. Add the following, and then save and quit:

[Service]
AmbientCapabilities=CAP_NET_BIND_SERVICE
PrivateUsers=false

Finally, we can restart the endlessh service:

sudo systemctl restart endlessh
sudo systemctl enable --now endlessh

That completes the setup of endlessh!

Configuring rsyslog to send email alerts

The second part of this process is to send automatic alerts whenever anyone connects to our endlessh service. Since endlessh forwards logs to syslog by default, reconfiguring rsyslog to send the alerts seems like the logical choice. In my case, I'm going to send email alerts - but other ways of sending alerts do exist - I just haven't looked into them yet.

To do this requires that you have either a working email server (I followed the Ars Technica taking email back series, but whatever you do it's not for the faint for heart! Command line experience is definitely required - if you're looking for a nice first project to try, a web server instead), or an email account you can use. Note that I do not recommend using your own personal email account, as you'll have to store the password in plain text!

In my case, I have my own email server, and I have forwarded port 25 down an SSH tunnel so that I can use it to send emails (in the future I want to configure a proper smart host that listen on port 25 and forwards emails by authenticating against my server properly, but that's for another time as I have yet to find a relay-only MTA that also listens on port 25).

In a previous post, implemented centralised logging - so I'm going to be reconfiguring my main centralised rsyslog instance.

To do this, open up /etc/rsyslog.d/10-endlessh.conf for editing, and paste in something like this:

template (name="mailSubjectEndlessh" type="string" string="[HONEYPOT] endlessh connection on %hostname%")

if ( ($programname == 'endlessh') and (($msg contains "ACCEPT") or ($msg contains "CLOSE")) ) then {
    action(type="ommail" server="localhost" port="20205"
        mailfrom="sender@example.com"
        mailto=["bill@billsboosters.net"]
        subject.template="mailSubjectEndlessh"
        action.execonlyonceeveryinterval="3600"
    )
}

...where:

[HONEYPOT] endlessh connection on %hostname% is the subject name, and %hostname% is substituted for the actual hostname the honeypot is running on
sender@example.com is the address that you want to send the alert FROM
bill@billsboosters.net is the address that you want to send the alert TO
3600 is the minimum interval between emails, in seconds. Log lines are not collected up - only 1 log line is sent at a time, and others logged in-between are ignored and handled as if the above email directive doesn't exist until the given number of seconds expires - at which point it will then email for the next log line that comes through, and the cycle then repeats. If anyone knows how to change that, please leave a command below.

Note that the template line is outside the if statement. This is important - I got a syntax error if I put it inside the if statement.

The if statement specifically looks for log messages with a tag of endlessh that contain either the substring ACCEPT or CLOSE. Only if those conditions are true will it send an email.

I have yet to learn how to configure rsyslog to authenticate while sending emails. I would suspect though that the easiest way of achieving this is to setup a local SMTP relay-only MTA (Mail Transfer Agent) that rsyslog can connect to and send emails, and then the relay will authenticate against the real server and send the email on rsyslog's behalf. I have yet to find such an MTA however other than Postfix - which, while great, can be hugely complicated to setup. Other alternatives I've tried include:

nullmailer
msmtp
dma - The Dragonfly Mail Agent

....but they all implement sendmail and while that's useful they do not listen on port 25 (or any other port for that matter) as far as I can tell.

Anyway, the other file you need to edit is /etc/rsyslog.conf. Open it up for editing, and put this near the top:

module(load="ommail")

...this loads the mail output plugin that sends the emails.

Now that we've reconfigured rsyslog, we need to restart it:

sudo systemctl restart rsyslog

rsyslog is picky about it's config file syntax, so make sure to check it's status for error messages:

sudo systemctl status rsyslog

You can also use lnav analyse your logs and find any error messages there too.

Conclusion

We've setup endlessh as a honeypot, and then reconfigured rsyslog to send email alerts. Test the system like so on your local machine:

ssh -vvv -p 22 someuser@yourserver

...and watch your inbox for the email alert that will follow shortly!

While this system isn't particularly useful on it's own, it's a small part of a larger strategy for securing my network. It's also been a testing ground for me to configure rsyslog to send email alerts - something I may want to configure my centralised rsyslog logging system to do for other things in the future.

If you've found this post useful or you have some suggestions, please leave a comment below!

Stardust Blog

Tag Cloud

A snapshot into my PhD: Rainfall radar model debugging

DeepLabV3+

Conclusion

I'm on Mastodon/Fediverse!

AI encoders demystified

Images / spatial data

Sequenced Data

Conclusion

Easily write custom Tensorflow/Keras layers

Background

Custom layers for the win!

Using and saving

Going further

Conclusion

NSD, Part 2: Dynamic DNS

Configuring the server

Configuring the client(s)

Conclusion

Sources and further reading

The NSD Authoritative DNS Server: What, why, and how

DNS in a (small) nutshell

A tale of 2 (DNS) servers

Authoritative DNS servers

The Name Server Daemon

Integration with the rest of your network

For Unbound:

For dnsmasq:

Conclusion

Sources and further reading

Mounting LVM partitions from the terminal on Linux

PhD Update 14: An old enemy

Rainfall radar model, revisited

Conclusion

The plan to caption and index images

AI for image captioning

Search engine and indexing

Configuring an endlessh honeypot with rsyslog email notifications

Setting up endlessh

Configuring rsyslog to send email alerts

Conclusion

Sources and further reading

Stardust
Blog