Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression conference conferences containerisation css dailyprogrammer data analysis debugging defining ai demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics guide hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs latex learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation outreach own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering research resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

Portable Python on Windows

I have been asked a number of times how to run a custom version of Python on a Windows machine without administrative access over a given machine (despite the fact that I use Linux and not Windows :P). I'm sure I've written a guide before, but I can't find it so I thought I'd write another and document it here for reference.

Essentially, the solution is as follows:

  1. Download a compressed archive of 'embeddable Python' for the version and CPU architecture of Python you need
  2. Delete python*._pth (where * is your Python version)
  3. Update PATH
  4. Run your Python program :D

Let's run through these steps in detail.

Downloading Python

To download Python, head to this address: https://www.python.org/downloads/windows/

Then, click on the following:

  1. "Latest Python 3 Release - Python 3.XX.Y" under "Python Releases for Windows", where XX.Y are digits. For example: "Latest Python 3 Release - Python 3.11.4"
  2. Scroll down to the heading "Files"
  3. Then, download the file entitled "Windows embeddable package (64-bit)", or whichever one suits your CPU architecture

Extract the contents of the resulting .zip to a folder.

Deleting the _pth file

Before we can do anything, we need to delete in the extracted folder. If you do not do this, you will find Python is unable to locate any modules.

Delete the file ending in ._pth. The exact name changes depending on the Python release you have downloaded, but in general it will be in the form python3XX._pth, where XX will be the same digits from above.

Updating PATH

Now that Python is downloaded and prepared, we need to update the PATH environment variable.

When you type a command into the command prompt (e.g. python), the command line interpreter will search all of the directories listed in PATH (semicolon separated on Windows, colon separated on Linux and macOS) to find that executable. This is also used by Python to locate modules on Windows.

Adapt the following command to your specific situation.

set PATH=C:\path\to\python;C:\path\to\python\Script;%PATH%

Caution: The path you choose MUST NOT contain any spaces.

Whenever you want to use the portable version of Python you've downloaded, you will need to execute your version of this command first.

Note that in some Jupyter Notebook environments the environment variable PATH is reset, so you will need to adjust the environment variable PATH in Jupyter.

Installing pip

Installing dependencies your code needs via pip is a common task, but with this embeddable/portable Python setup it requires a little bit more work. Execute the following command:

curl -sSLO https://bootstrap.pypa.io/get-pip.py
python get-pip.py

If you don't have curl installed, then download get-pip.py in your browser instead of running the curl command.

Once it completes, you should be able to use the pip command as normal:

pip install tensorflow seaborn pandas numpy

Conclusion

We've downloaded embeddable Python and set it up for portable use on Windows. Please remember that if you are on a shared computer you should be mindful of disk space usage and put your copy of portable Python on a USB flash drive, or otherwise delete it when you're done. This is especially important if you install big packages like tensorflow, which can be 1GiB+!

Another thing to keep in mind is keeping your portable Python installation up to date. Add a reminder in your calendar to check the Python website I linked to above regularly to ensure you get security updates.

Alternative methods for those with admin access include package managers such as choclately and scoop.

This concludes this guide. If you're looking for a new operating system, I can recommend Linux :D

PhD Update 16: Realising the possibilities of the past

Hey there! It's been a while. As I explained in a previous post, I've been adjusting to a new part-time position in my department! In short, posts will continue on here by may be slightly less frequent than before.

Before we begin, here is the customary list of previous posts in this series:

A lot has happened since last time! I'm going to split this up into sections as I have in previous posts, but in summary the noodling around I've done since the last post has really paid off.

Publication

After another round of revisions, my journal article on my social media research has now been accepted and published!

View it here:

https://doi.org/10.1016/j.cageo.2023.105405

It's my first published journal article, so I am quite excited about it :D

If I haven't already (I'm writing this post first, as I only got notified about its publication while I was writing this post!), I'll definitely be making another post about it here!

Rainfall Radar

The main thing I've focused on a lot is my rainfall radar model, and beating it into some kind fo shape that actually works. This has not been a simple process, but I think that this graph speaks for itself:

It works! I can scarcely believe that after nearly 3 and a half years it finally produces a useful output. This feels like a big personal achievement - as those who have been following this series will know, I have tried many different things before reaching this point.

The first question I know will be on your mind is "How did we get here?", and the answer to this lies in 2 things:

  1. Connectedness
  2. Resolution and boundary difficulties

Let's tackle connectedness first. By connectedness, I mean specifically parts of a given model connecting back on themselves. This is important for multiple reasons, not least because it reduces the effect of the vanishing gradient problem. This is also the reason that ResNet adds skip-connections, as then the gradient weight updated used when the model is backpropagating can flow all the way up to the top of the model. Without this, the weights in the initial layers can't update, and information is then lost before it even makes it very far into the model.

From what I can tell, this is the primary problem with the models that I have tried so far, in one way or another. Autoencoders, for example, do not have very much connectedness, making it difficult for backpropagation to do its thing.... especially when the task at hand is a significantly difficult one.

The solution I ended up employing here was DeepLabV3+. It uses an image encoder at first, and then a PSPNet-style pyramid scheme for analysing multiple scales of features, and then finally an image segmentation head. It also has a skip connection between halfway up the image encoder and the segmentation head too, further increasing the connectedness of the model.

Once I had something from the initial DeepLabV3+ model, improving the output was a matter of increasing the resolution of the output and adjusting things so that the model can better resolve the boundaries of the water / no water regions I was asking it to predict.

To this end, with this tweaking I describe DeepLabV3+ has turned out to be the ideal model for the task. The changes and hyperparameters I used can be summarised like so:

  • Loss function: Add dice loss to cross-entropy loss.
  • Learning rate: Reduce to 0.00001 from the default 0.001
  • Upscaling: Hack the model to upscale the input/outputs. This increases the resolution the model operates at, improving performance significantly
  • Removing isolated pixels: I removed water pixels with no neighbour pixels being water. This is like a band-aid on the real problem (a bad physics-based model run), but it did help.

With this working models, I can now consider the other avenues I was exploring in part 15 of this series reasonably as dead ends, though I have learned a lot by investigating them.

I consider this model to be a proof of concept only. The idea needs a lot more adjusting and improving before it will be actually useful to anyone. Still, it might be improve short-medium term (i.e. ~up to a few hours in advance) flooding forecasts with some a lot more work I think. While my focus currently is writing up my thesis (see below), I do plan on continuing to work on this and other research projects on the side on a long-term basis. One of my many goals is to wrangle this model into something more than just a proof of concept, and somehow measure it's effectiveness more precisely.

The long road to thesis and beyond

With my funding for the main research period of my PhD at an end, my focus has been shifting to the writing of my thesis. The feeling is actually quite surreal - for the longest time writing my thesis has been a mystical objective far off in the distance in a blurry haze, but now the details are very much resolving into something more tangible.

So far I have a draft chapter based on my recent journal article, and part of a chapter on the rainfall radar model I've talked about briefly above. I also have part of an introduction and a background sections, but these require significant reworking because I wrote them ages ago (they are just bad).

The plan is to have my thesis complete by December 2023, potentially giving the required 3 months submission notice in ~September 2023 - depending on how things go with writing.

When my PhD comes to a close, that will also mean the end of this series of blog posts. I think this is the longest series of blog posts I've ever posted here, and certainly one of the most personal. This does not mean the end of posts on here about my research though, as I plan to continue blogging about it on here. The form this will take is likely to be similar to the form that the posts for my PhD have taken.

As I don't currently know what form my research will take after my PhD, I cannot say what will happen about blogging about it, only that it will happen ;-)

Thanks for sticking with me throughout this long and at times difficult process - it's been a wonderful and wild ride! Even as this part of my journey is beginning to come to a close, I really appreciate all the help and support everyone has given me throughout the process.

I'll try my best to keep up with this series again, now that I've had some time to adjust to being experimental officer. Until next time!

The journal article about my social media research is out now!

This is just a quick little post to announce that I have published my first journal article! This has been a significantly long time in the making, with the review process and all associated corrections alone taking from October 2022 until a week or two ago.

It has been published in the Elsevier journal Computers and Geosciences, with the following title:

Real-time social media sentiment analysis for rapid impact assessment of floods

The article is open access, so everyone should be able to read it. I must thank everyone who has helped and contributed to the process of putting this journal article together - their names are on the journal article.

Hopefully this is the first of many!

Syntax highlighting and word wrapping code in LaTeX

For the longest time, I didn't think it was possible to syntax highlight code in LaTeX, but I was proven wrong a few months ago when I stumbled across the minted package, which uses pygments under-the-hood. Even better, it ships with texlive by default!

Recently, I had need to put it to use once more, and I encountered a bug in which my source code was too long for the line, but minted did not wrap it on to the following line. The solution was quite simple but definitely non-obvious, so I thought I'd make a quick post about it here.

To start with, minted works a lot like the lstlisting you might have used before:

An example block of syntax highlighted Javascript:

\begin{minted}{javascript}
"use strict";

export default async function(x, y) {
    return x + y;
}
\end{minted}

Pretty easy, right? But if you have a really long line (or lots of indentation), then it might overflow the edge of the page without wrapping. Thankfully there's an easy way to fix it that I discovered after digging around for a bit:

\begin{minted}[breaklines,breakanywhere]{javascript}
"use strict";

export default async function(x, y) {
    return x + y;
}
\end{minted}

....by adding [breaklines,breakanywhere] directly after \begin{minted}, we can get it to wrap onto the next line! Even better, we can use the same trick to also add line numbers for easy referencing later:

\begin{minted}[breaklines,breakanywhere,linenos]{javascript}
"use strict";

export default async function(x, y) {
    return x + y;
}
\end{minted}

The linenos option here causes minted to draw line numbers before each line of code. This also respects wrapped lines too, so they don't get all out of sync. Here's a sample of all of these tricks put together in action:

A screenshot of some Python code, syntax highlighted with minted. Some lines wrap onto multiple physical lines on the page. the code is part of a function that preprocesses text for GloVe.

I hope this helps someone out, because I know I find it very useful. I'll hopefully be posting another PhD update blog post soon for those who are eagerly await it - I know it's overdue!

Post a comment below if you have anything specific you'd like me to cover, and I'll do my best to make a post about it.

Achievement get: Experimental Officer position!

Hey there, everyone! It's been a while since I last posted here. Rest assured that I haven't abandoned this blog.

What I have been doing is adjusting to a new job! I haven't talked about it yet because this adjustment process takes time, and I needed space to do this process at my own pace. While I'm still adjusting and will be for some time, I'm now at a point where I feel comfortable to share with everyone what I've been up to.

My job title is Experimental Officer in the Department of Computer Science at the University of Hull - the same place as I've been doing my PhD (as is listed on the homepage of this site). I started this in January 2023.

This is an academic position, and it consists of academic teaching support (ie supporting lecturers in the delivery of their course content) combined with systems administration and managing the use of specialist equipment.

This role feels ideal for me at this time, as its a mix of academicy teaching support stuff and a technical role. It also has some teaching on the side, which is something I have very limited experience with, so it's a great chance to learn.

I'm doing this role part time at the moment while I finish my PhD.

In practical effects for this blog, it means that my posting frequency will be somewhat lower than it has been in times past, as you may have been noticing in the months leading up to my three month break. I'm going to personally aim for a blog post every 2 weeks, but it might be longer or shorter than that depending on how much energy I have to write posts and whether there's something I really want to talk about without waiting.

Given that I've been posting here since June 2014 (~9 years!), this blog is important to me, and so I thought it would be fitting to let you who read this blog know first about this news. I'll also continue to document my journey through the world of computer science into the future.

This will include a continuing to blog about my research post-PhD (no, I still haven't had enough of it :P) in some sort of sequel series to my PhD update blog posts that I've been writing. It will also include the random blog posts you've surely come to expect from me about neat things I've discovered and interesting things I've done.

Another milestone I am about to hit soon is my first published journal article! It has been accepted for publication, so I'm currently working through that process. The title will be "Real-time social media sentiment analysis for rapid impact assessment of floods", and will definitely be posting here as soon as it's published.

I'd like to thank everyone who has supported me in this journey so far up until this point. I really appreciate it!

I hope you'll continue to stick around here with me as I move forwards into this new era!

Visualising Tensorflow model summaries

It's no secret that my PhD is based in machine learning / AI (my specific title is "Using Big Data and AI to dynamically map flood risk"). Recently a problem I have been plagued with is quickly understanding the architecture of new (and old) models I've come across at a high level. I could read the paper a model comes from in detail (and I do this regularly), but it's much less complicated and much easier to understand if I can visualise it in a flowchart.

To remedy this, I've written a neat little tool that does just this. When you're training a new AI model, one of the things that it's common to print out is the summary of the model (I make it a habit to always print this out for later inspection, comparison, and debugging), like this:

model = setup_model()
model.summary()

This might print something like this:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 500, 16)           160000    

lstm (LSTM)                  (None, 32)                6272      

dense (Dense)                (None, 1)                 33        
=================================================================
Total params: 166,305
Trainable params: 166,305
Non-trainable params: 0
_________________________________________________________________

(Source: ChatGPT)

This is just a simple model, but it is common for larger ones to have hundreds of layers, like this one I'm currently playing with as part of my research:

(Can't see the above? Try a direct link.)

Woah, that's some model! It must be really complicated. How are we supposed to make sense of it?

If you look closely, you'll notice that it has a Connected to column, as it's not a linear tf.keras.Sequential() model. We can use that to plot a flowchart!

This is what the tool I've written generates, using a graphing library called nomnoml. It parses the Tensorflow summary, and then compiles it into a flowchart. Here's a sample:

It's a purely web-based tool - no data leaves your client. Try it out for yourself here:

https://starbeamrainbowlabs.com/labs/tfsummaryvis/

For the curious, the source code for this tool is available here:

https://git.starbeamrainbowlabs.com/sbrl/tfsummaryvis

NAS Series List

Somehow, despite posting about my NAS back in 2021 I have yet to post a proper series list post about it! I'm rectifying that now with this quick post.

I wrote this series of 4 posts back when I first built my new NAS box.

Here's the full list of posts in the main NAS series:

Additionally, as a bonus, I also later in 2021 I wrote a pair of posts back how I was backing up my NAS to a backup NAS. Here they are:

How (not) to recover a consul cluster

Hello again! I'm still getting used to a new part-time position at University which I'm not quite ready to talk about yet, but in the mean time please bear with me as I shuffle my schedule around.

As I've explained previously on here, I have a consul cluster (superglue service discovery!) that forms the backbone of my infrastructure at home. Recently, I had a small powercut that knocked everything offline, and as the recovery process was quite interesting I thought I'd blog about it here.

The issue at had happened at about 5pm, but I only discovered it was a problem until a few horus later when I got home. Essentially, a small powercut knocked everything offline. While my NAS rebooted automatically afterwards, my collection of Raspberry Pis weren't so lucky. I can only suspect that they were caught in some transient state or something. None of them responded when I pinged them, and later inspection of the logs on my collectd instance revealed that they were essentially non-functional until after they were rebooted manually.

A side effect of this was that my Consul (and, by extension, my Nomad cluster) cluster was knocked offline.

Anyway, at first I only rebooted the controller host (that has both a Consul and Nomad server running on it, but does not accept and run jobs). This rebooted just fine and came back online, so I then rebooted my monitoring box (that also runs continuous integration), which also came back online.

Due to the significantly awkward physical location I keep my cluster in with the rest of the Pis, I decided to flip the power switch on the extension to restart all my hosts at the same time.

While this worked..... it also caused my cluster controller node to reboot, which caused its raft epoch number to increment by 1... which broke the quorum (agreement) of my cluster, and required manual intervention to resolve.

Raft quorum

To understand the specific issue here, we need to look at the Raft consensus algorithm. Raft is, as the name suggests, a consensus algorithm. Such an algorithm is useful when you have a cluster of servers that need to work together in a redundant fault-tolerant fashion on some common task, such as in our case Consul (service discovery) and Nomad (task scheduling).

The purpose of a raft server is to maintain agreement amongst all nodes in a cluster as to the global state of an application. It does this using a distributed log that it replicates through a fancy but surprisingly simple algorithm.

At the core of this algorithm is the concept of a leader. The cluster leader is responsible for managing and committing updates to the global state, as well as sending out the global state to everyone else in the cluster. In the case of Consul, the Consul servers are the cluster (the clients simply connect back to whichever servers are available) - and I have 3 of them, since Raft requires an odd number of nodes.

When the cluster first starts up or the leader develops a fault (e.g. someone sets off a fork bomb on it just for giggles), an election occurs to decide on a new leader. The election term number (or epoch number) is incremented by one, and everyone votes on who the new leader should be. The node with the most votes becomes the new leader, and quorum (agreement) is achieved across the entire cluster.

Consul and Raft

In the case of Consul, everyone must cast a vote for the vote to be considered valid, otherwise the vote is considered invalid and the election process must begin again. Crucially, the election term number must also be the same across everyone voting.

In my case, because I started my cluster controller and then rebooted it before it had a chance to achieve quorum, it incremented it's election term number and additional time than the rest of the cluster did, which caused the cluster to fail to reach quorum as the other 2 nodes in the Consul server cluster consider the controller node's vote to be invalid, yet they still demanded that all servers vote to elect a new leader.

The practical effect of this was tha because the Consul cluster failed to agree on who the leader should be, the Nomad cluster (which hangs off the Consul cluster, using it to find each other) also failed to start and subsequently reach quorum, which knocked all my jobs offline.

The solution

Thankfully, the Hashicorp Consul documentation for this specific issue is fabulous:

https://developer.hashicorp.com/consul/tutorials/datacenter-operations/recovery-outage#failure-of-a-server-in-a-multi-server-cluster

To summarise:

  1. Boot the cluster as normal if it isn't booted already
  2. Stop the failed node
  3. Create a special config file (raft/peers.json) that will cause the failed node to drop it's state and accept the state of the incomplete cluster, allowing it to rejoin and the cluster gain collective quorum once more.

The documentation to perform this recovery protocol is quite clear. While there is an option to recover a failed node if you still have a working cluster with a leader, in my case I didn't so I had to use the alternate route.

Conclusion

I've talked briefly about an interesting issue that caused my Consul cluster to break quorum, which inadvertently brought my entire infrastructure down until resolved the issue.

While Consul is normally really quite resilient, you can break it if you aren't careful. Having an understanding of the underlying consensus algorithm Raft is very helpful to diagnosing and resolving issues, though the error messages and documentation I looked through were generally clear and helpful.

Considerations on monitoring infrastructure

I like Raspberry Pis. I like them so much that by my count I have at least 8 in operation at the time of typing performing various functions for me, including a cluster for running various services.

Having Raspberry Pis and running services on servers is great, but once you have some infrastructure setup hosting something you care about your thoughts naturally turn to mechanisms by which you can ensure that such infrastructure continues to run without incident, and if problems do occur they can be diagnosed and fixed efficiently.

Such is the thought that is always on my mind when managing my own infrastructure, sprawls across multiple physical locations. To this end, I thought I'd blog what my monitoring system looks like - what it's strengths are, and what it could do better.

A note before we begin: I continue to have a long-term commitment to posting on this blog - I have just started a part-time position alongside my PhD due to the end of my primary research period, which has been taking up a lot of my mental energy. Things should get slowly back to normal soon-ish.

Keep in mind as you read this that my situation may be different to your own. For example, monitoring a network primary consisting of Raspberry Pis demands a very different approach than an enterprise setup (if you're looking for a monitoring solution for a bunch of big powerful servers, I've heard the TICK stack is a good place to start).

Monitoring takes many forms and purposes. Broadly speaking, I split the monitoring I have on my infrastructure into the following categories:

  1. Logs (see my earlier post on Centralising logs with rsyslog)
  2. System resources (e.g. CPU/RAM/disk/etc usage) - I use collectd for this
  3. Service health - I use Consul for my cluster, and Uptime Robot for this website.
  4. Server health (e.g. whether a server is down or not, hanging due to a bad mount, etc.)

I've found that as there are multiple categories of things that need monitoring, there isn't a single one-size-fits-all solution to the problem, so different tools are needed to monitor different things.

Logs - centralised rsyslog

At the moment, monitoring logs is a solved problem for me. I've talked about my setup previously, in which I have a centralised rsyslog server which receives and stores all logs from my entire infrastructure (barring a few select boxes I need to enrol in this system). Storing logs nets me 2 things:

  1. The ability to reference them (e.g. with lnav) later in the event of an issue for diagnostic purposes
  2. The ability to inspect the logs during routine maintenance for any anomalies, issues, or errors that might become problematic later if left unattended

System information - collectd

Similarly, storing information about system resource usage - such as CPU load or disk usage for instance - is more useful than you'd think for spotting and pinpointing issues with one's infrastructure - be it a single server or an entire fleet. In my case, this also includes monitoring network latency (useful should my ISP encounter issues, as then I can identify if it's a me or a them problem) and HTTP response times.

For this, I use collectd, backed by rrd (round-robin database) files. These are fixed-size files that contain ring buffers that it iteratively writes over, allowing efficient storage of up to 1 year's worth of history.

To visualise this in the browser, I use Collectd Graph Panel, which is unfortunately pretty much abandonware (I haven't found anything better).

To start with the strengths of this system, it's very computationally efficient. I have tried previously to setup a TICK (Telegraf, InfluxDB, Chronograf, and Kapacitor) stack on a Raspberry Pi, but it was way too heavy - especially considering the Raspberry Pi my monitoring system runs on is also my continuous integration server. Collectd, on the other hand, runs quietly in the background, barely using any resources at all.

Another strength is that it's easy and simple. You throw a config file at it (which could be easily standardised across an entire fleet of servers), and collectd will dutifully send encrypted system metrics to a given destination for you with minimal fuss. Meanwhile, the browser-based dashboard I use automatically plots graphs and displays them for you without any tedious creation of a custom dashboard.

Having a system monitor things is good, but having it notify you in the event of an anomaly is even better. While collectd does have the ability to generate and send notifications, its capacity to do this is unfortunately rather limited.

Another limitation of collectd is that accessing and processing the stored system metrics data is not a trivial process, since it's stored in rrd databases, the parsing of which is surprisingly difficult due to a lack of readily available libraries to do this. This makes it difficult to integrate it with other systems, such as n8n for example, which I have recently setup to replace some functions of IFTTT to automatically repost my blog posts here to Reddit and Discord.

Collectd can write to multiple sources however (e.g. MQTT), so I might look into this as an option to connect it to some other program to deliver more flexible notifications about issues.

Service health

Service health is what most people might think of when I initially said that this blog post would be about monitoring. In many ways, it's one of the most important things to monitor - especially if other people rely on infrastructure which is managed by you.

Currently, I achieve this in 2 ways. Firstly, for services running on the server that hosts this website I have a free Uptime Robot account which monitors my server and website. It costs me nothing, and I get monitoring of my server from a completely separate location. In the event my server or the services thereon that it monitors are down, I will get an email telling me as such - and another email once it goes back up again.

Secondly, for services running on my cluster I use Consul's inbuilt service monitoring functionality, though I don't yet have automated emails to notify me of failures (something I need to investigate a solution for).

The monitoring system you choose here depends on your situation, but I strongly recommend having at least some form of external monitoring of whether your target boxes go down that can notify you of this. If your monitoring is hosted on the box that goes down, it's not really of much use...!

Monitoring service health more robustly and notifying myself about issues is currently on my todo list.

Server health

Server health ties into service health, and perhaps also system information too. Knowing which servers are up and which ones are down is important - not least because of the services running thereon.

The tricky part of this is that if a server goes down, it could be because of any one of a number of issues - ranging from a simple software/hardware failure, all the way up to an entire-building failure (e.g. a powercut) or a natural disaster. With this in mind, it's important to plan your monitoring carefully such that you still get notified in the event of a failure.

Conclusion

In this post, I've talked a bit about my monitoring infrastructure, and things to consider more generally when planning monitoring for new or existing infrastructure.

It's never too late to iteratively improve your infrastructure monitoring system - whether it be enrolling that box in the corner that never got added to the system, or implementing a totally kind of monitoring - e.g. centralised logging, or in my case I need to work on more notifications for when things go wrong.

On a related note, what do your backups look like right now? Are they automated? Do they cover all your important data? Could you restore them quickly and efficiently?

If you've found this interesting, please leave a comment below!

PhD Update 15: Finding what works when

Hey there! Sorry this post is a bit late again: I've been unwell.

In the last post, I revisited my rainfall radar model, and shared how I had switched over to using .tfrecord files to store my data, and the speed boost to training I got from doing that. I also took an initial look at applying contrastive learning to my rainfall radar problem. Finally, I looks a bit into dimensionality reduction algorithms - in short: use UMAP (paper).

Before we continue, here's the traditional list of previous posts:

In addition, there's also an additional intermediate post about my PhD entitled "A snapshot into my PhD: Rainfall radar model debugging", which I posted since the last PhD update blog post. If you're interested in details of the process I go through stumbling around in the dark doing research on my PhD, do give it a read:

https://starbeamrainbowlabs.com/blog/article.php?article=posts/519-phd-snapshot.html

Since last time, I have been noodling around with the rainfall radar dataset and image segmentation models to see if I can find something that works. The results are mixed, but the reasons for this are somewhat subtle, so I'll explain those below.

Social media journal article

Last time I mentioned that I had written up my work with social media sentiment analysis models, but I had yet to finalise it to send to be published. This process is now completed, and it's currently under review! It's unlikely I'll have anything more to share on this for a good 3-6 months, but know that it's a process happening in the background. The journal I've submitted it to is Elsevier's Computers and Geosciences, though of course since it's under review I don't yet know if they will accept it or not.

Image segmentation, sort of

It doesn't feel like I've done much since last time, but looking back I've done a lot, so I'll summarise what I've been doing here. Essentially, the bulk of my work has been into different image segmentation models and strategies to see what works and what doesn't.

The specific difficulty here is that while I'm modelling my task of going from rainfall radar data (plus heightmap) to water depth in 2 dimensions as an image segmentation task, it's not exactly image segmentation, in that the output is significantly different in nature to the input I'm feeding the model, which complicates matters as this significantly increases the difficulty of the learning task I'm attempting to get the model to work on.

As a consequence of this, it is not obvious which model architecture to try first, or which ones will perform well or not, so I've been trying a variety of different approaches to see what works and what doesn't. My rough plan of model architectures to try is as follows:

  1. Split: contrastive pretraining
  2. Mono / autoencoder: encoder [ConvNeXt] → decoder [same as #1]
  3. Different loss functions:
    • Categorical Crossentropy
    • Binary Crossentropy
    • Dice
  4. DeepLabV3+ (almost finished)
  5. Encoder-only [ConvNeXt/ResNet/maybe Swin Transformer] (pending)

Out of all of these approaches, I'm almost done with DeepLabV3+ (#4), and #5 (encoder-only) is the backup plan.

My initial hypothesis was that a more connected image segmentation model such as the pre-existing PSPNet, DeepLabV3+, etc would not be a suitable choice for this task, since regular image segmentation models such as these place emphasis on the output being proportional to the input. Hence, I theorised that an autoencoder-style model would be the best place to start - especially so since I've fiddled around with an autoencoder before, albeit for a trivial problem.

However, I discovered with approaches #1 and #2 that autoencoder-style models with this task have a tendency to get 'lost amongst the weeds', and ignore the minority class:

To remedy this, I attempted to use a different loss function called Dice, but this did not help the situation (see the intermediary A snapshot into my PhD post for details).

I ended up cutting the contrastive pretraining temporarily (#1), as it added additional complexity to the model that made it difficult to debug. In the future, when the model actually works, I intend to revisit the idea of contrastive pretraining to see if I can boost the performance of the working model at all.

If there's one thing that doing a PhD teaches you, it's to keep on going in the face of failure. I should know: my PhD has been full of failed attempts. I saying I found online (I forget who said it, unfortunately) definitely rings true here: "The difference between the novice and the master is that the master has failed more times than the novice has tried"

In the spirit of this, this brings us to the next step of proving (or disproving) that this task is possible, which is to try a pre-existing image segmentation model to see what happens. After some research (against, see the intermediary A snapshot into my PhD post for details), I discovered that DeepLabV3+ is the current state of the art for image segmentation.

After verifying that DeepLabV3+ actually works with it's intended dataset, I've now just finished adapting it to take my rainfall radar (plus heightmap) dataset as an input instead. It's currently training as I write this post, so I'll definitely have some results for next time.

The plan from here depends on the performance of DeepLabV3+. Should it work, then I'm going to first post an excited social media post, and then secondly try adding an attention layer to further increase performance (if I have time). CBAM will probably be my choice of attention mechanism here - inspired by this paper.

If DeepLabV3+ doesn't work, then I'm going to go with my backup plan (#5), and quickly try training a classification-style model that takes a given area around a central point, and predicts water / no water for the pixel in the centre. Ideally, I would train this with a large batch size, as this will significantly boost the speed at which the model can make predictions after training. In terms of the image encoder, I'll probably use ConvNeXt and at least one other image encoder for comparison - probably ResNet - just in case there's a bug in the ConvNeXt implementation I have (I'm completely paranoid haha).

Ideally I want to get a basic grasp on a model that works soon though, and leave too much of the noodling around with improving performance until later, as if at all possible it would be very cool to attend IJCAI 2023. At this point it feels unlikely I'll be able to scrape something together for submitting to the main conference (the deadline for full papers is 18th January 2023, abstracts by 11th January 2023), but submitting to an IJCAI 2023 workshop is definitely achievable I think - they usually open later on.

Long-term plans

Looking into the future, my formal (and funded) PhD research period is coming to an end this month, so I will be taking on some half time work alongside my PhD - I may publish some details of what this entails at a later time. This does not mean that I will be stopping my PhD, just doing some related (and paid) work on the side as I finish up.

Hopefully, in 6 months time I will have cracked this rainfall radar model and be a good way into writing my thesis.

Conclusion

Although I've ended up doing things a bit back-to-front on this rainfall radar model (doing DeepLabV3+ first would have been a bright idea), I've been trying a selection of different model architectures and image segmentation models with my rainfall radar (plus heightmap) to water depth problem to see which ones work and which ones don't. While I'm still in the process of testing these different approaches, it will not take long for me to finish this process.

Between now and the next post in this series in 2 months time, I plan to finish trying DeepLabV3+, and then try an encoder-only (image classification) style model should that not work out. I'm also going to pay particularly close attention to the order of my dimensions and how I crop them, as I found yesterday that I mixed up the order of the width and height dimensions, feeding one of the models I've tested data in the form [batch_size, width, height, channels] instead of [batch_size, height, width, channels] as you're supposed to.

If I can possibly manage it I'm going to begin the process of writing up my thesis by writing a paper for IJCAI 2023, because it would be very cool to get the chance to go to a real conference in person for the first time.

Finalyl, if anyone knows of any good resources on considerations in the design of image segmentation heads for AI models, I'm very interested. Please do leave comment below.

Art by Mythdael