Try 1.5: Python + numpy

Published on May 7, 2024, 6:45 p.m.
Last changed on June 6, 2024, 8:08 p.m.

Ok so… when someone encounters the words "python" and "number crunching" in the same sentence, very often a single word will be conjured up in that person's mind:

numpy

And that's it. It would be foolish to argue/believe that numpy (and the innumerable members of its ecosystem) is not one of the most known and universally used tool in the python world when one has big data crunching in mind. Spicy is another magnificent monster, but some would argue it being richer in features, albeit slower.

Anyway, the very same word was conjured up in my mind so long story short: numpy it was.

All was suddenly faster

The code was quickly and easily "upgraded" to using numpy: np.array([]) everywhere!
The scripts/management commands were run and all was good. For an identical amount of generated data, times were anywhere between 3 and 5 times faster.
Great, now let's add more features to the data.
Although… I should have been wary of such a "small" performance gain. More on this below.

Adding more features

With the time-saving assurance in my heart and the wind in my back, I eagerly started strolling down the path of more depth; in other words, I started working on deeper levels of details, most notably satellites physical characteristics.
Parallel to this, a first version of the 3D client was starting to take shape as well. Naturally, this led me to take immersion and "navigation" more and more into account; for this reason, I had to start working on breadth; in other words, on a wider and richer view of the nearby universe when navigating it.

One can see where this was going:

See more things
Each thing with more details

Now, even when implementing various lazy/on-demand loading techniques, even the numpy code started to slow to a crawl. The main culprit: random number generation, and the code was implying far, far too many "travels" between two worlds:

python
numpy (==C)

The gains performed by numpy, while far from negligible, were eaten alive by the python<>numpy discussions while generating content.

Code architecture choices

I could probably have completely changed the way data was generated, and play with parallel numpy arrays, i.e. ~~use numpy properly~~ leverage the huge performance of numpy and vectorise more.
But I did not want to do that; on the contrary, I wanted to keep the "classical" code architecture that, believe it or not, made the code so easy to understand, debug and extend: good old classes, objects, methods and properties are, like it or not, very very convenient. :-)

Numpy is absolutely great to read, mutate and compare huge series of data. My problem was: I did not have these series, I needed to generate them:

In a reproducible way: seeding the numpy random number generator several times at each "iteration" was incredibly expensive
In a unique way at each interation

So the numpy phase was rather short-lived. And I had to come up with something else.