Hobrasoft: Mastering artificial intelligence is not easy. Why?

26.8.2021

Petr Bravenec
Twitter: @BravenecPetr
+420 777 566 384
petr.bravenec@hobrasoft.cz

I first encountered a understandable description of artificial intelligence at the beginning of 1990s, probably in the book "Paths of Modern Programming" by Peter Koubsky. The book got lost in the course of the geological epochs, so I can't say with absolute certainty that at the end of the book, on page 290, there was a schematic sketch of a neural network, the backpropagation algorithm explained, and everything distorted by the author's idea of neural networks.

At the time, my attempts with neural networks ended in complete incomprehension and debacle. But that doesn't mean I wasn't interested in the subject. At the moment it became available enough information and examples, I came back to artificial intelligence.

Nowadays, there is a lot of literature, courses and examples available and there are many ways to learn how to work with neural networks (Introduction to Machine Learning). Neural networks can be simple, and learning texts and examples are easy to understand. Personally, I think that understanding and practically mastering artificial intelligence may not be difficult, but it does require studying a large amount of of materials, try out a lot of examples, spend a huge amount of time of unsuccessful experiments and not giving up. In fact, there are a number of obstacles awaiting the interested candidate.

So let's talk about what makes neural networks a rather difficult topic.

What is artificial intelligence

Simple question, complex answer. The term "artificial intelligence" sells well, so the term "artificial intelligence" the marketing departmets use too often in places where no artificial intelligence was used actualy. For example, an antispam filter is a classic task. Algorithms for sorting mail, that are closest to artificial intelligence are based on Bayesian statistics. Still, I would not call such a statistical system intelligent.

For simplicity, I would define artificial intelligence in today's sense as applications that use the backpropagation algorithm to program (learn) themselves. All modern image recognition networks, image detection people, facial recognition, autonomous driving etc falls in the category.

The backpropagation (Andrej Karpathy: CS231n Winter 2016: Lecture 4: Backpropagation, Neural Networks 1) algorithm search for an solution using successive approximations. Such applications are not programmed, they are trained.

Terminology

If you are beginning to study neural networks, your daily tool at least in the beginning will be a small machine learning glossary. Without it, you'll be staring at any text like a goose in a bottle, and generally you'll have no idea what the text says. So let's see: training, tensor, shape, backpropagation, gradient, normalization, z-score, regularization, L1, L2, GPU, CPU, TPU, dropout, overfitting, fully-connected layer, loss, mean squared error, metric, recall, precission, f1, euclidean distance, new-york distance, cosine similarity, crossentropy, softmax, embedding, centroid, t-sne, KNN, adam, LSTM, GRU, RNN, RELU, augmenting, convolution, inference...

There is usually nothing complicated about these terms, relatively simple things are hidden behind them. But for beginners, there are a lot of these concepts and it can be easy to get lost in them.

If you've started programming before, and you've come to the field of AI you'll inevitably find the terminology confusing. In other fields, some terms are used differently. Embedding for example, has absolutely nothing to do with deployment neural network on embedded computer - same word, two completely different meanings.

Mathematics

It is often argued that it is vital to know the mathematics behind neural networks. It's true, and it's not true. In my opinion, artificial intelligence today is really more of a mathematical field with a lot of overlapping into programming. Who wants to do AI seriously today, without an orientation in the relevant mathematical fields he has no chance.

The mathematic behind neural networks forms a great wall between the average programmer and neural networks. The ordinary developers have no need to think mathematically (they think more algorithmically), they don't know the terminology used, and they usually do not understand the statistical principles used in the field of artificial intelligence. The knowledge barrier is very high.

On the other hand, to download the demo example from gihub and to customize it, for example, to distinguish between a closed and open door, you don't need mathematics. But it will be much easier and hardware-efficient to design your own simple convolutional network, if you are an experienced AI programmer. Then the knowledge of mathematics is very useful.

The first working, commercially successful AI-based product we created without a deep understanding of the mathematics behind neural networks. And the style of work matched that. The FANN library on which we built the application is simple and easy to use, very well adapted to the thinking of a programmer with no experience with neural networks. The developer doesn't have to worry about what's going on inside, the library doesn't require any deep knowledge of mathematics. To evaluate the neural network's performance, we therefore had to reinvent the wheel and we had to discover well-known statistical methods (specificity, sensitivity, accuracy...)

You can't exist without math if you're using a library TensorFlow (or others). The TensorFlow documentation is full of mathematical formulas and expressions that can make your life very complicated if you don't know them.

Matematika může být velmi užitečná při nasazování sítě. Z některých architektur sítí je potřeba vyříznout některé prvky, které jsou potřebné pouze pro trénování sítě. Tato potřeba může nastat třeba z výkonnostních důvodů, nebo proto, že cílová platforma vůbec nepodporuje některé operace. Pokud je takovou nepodporovanou operací dropout, je situace ještě únosná a snadno zvládnutelná. Pokud není podporovaná nějaká matematická operace, máte problém. Neočekávejte úspěchy, pokud nevíte, že "druhou mocninu lze nahradit násobením". Do uvozovek v předchozí větě si přitom můžete dosadit libovolný matematický výraz běžně používaný v oblasti AI.

The mathematics required is not complex, but it includes a number of disciplines. It is very useful to have at least a basic overview (in order of importance):

linear regression,
statistics,
probability,
convolution,
matrices,
fourier decomposition,
derivations and integrals.

You won't encounter linear regression anywhere in neural networks. However, the neural network can be explained very well using regression. And the regression is what the neural network really do: it simulates the desired function, and to do that it uses some form of regression to find it. So when you read about regression in a text about neural networks, it's useful to understand the concept. After all, linear regression is useful in other areas of human activities and allows you to understand and solve other IT problems.

From linear regression, we move into its broader mathematical context - to statistics. Any text on artificial intelligence, unless it been written in a company's marketing department, will contain a lot of of statistical concepts. After all, a neural network is one big statistical behemoth. You will never get a definitive yes/no answer from a neural network. And if you do, it's still going to be sometimes yes and sometims no. The neural network actually only estimates the results! You well need statistics to be able to deploy a neural network into production.

The probability is closely related to statistics. Neural network never say: "It's a banana," but, "With 81% probability, it's a banana." Probability and statistics are often just two different views of the same problem.

For image recognition, the convolutional networks are used. Again - the internet is full of ready-made and trained examples, nothing stops you from downloading YOLO or SSD and start using it to recognize people, cars etc. The other side of the convolution is Fourier decomposition. It is good to know about it.

The vast majority of data in neural networks exits as a matrix. I know practically nothing about matrices, and that's fine. The data = matrix approach is natural and very intuitive for developers.

While you will rarely encounter integrals. when working with neural networks, the concept of derivation is quite common. In mathematics, derivations are used to find extremes on a function. And that's exactly what a neural network does. You'll never have to derivate any function when working with a neural network. But it's good to know how the derivative is defined, how it's used in neural networks and what a derivable function should look like.

Derivatives are related to linear regression. Linear regression is a great and illustrative tool to describe the work of a neural network. But while you just need to know a few simple formulas to use linear regression, as the dimensions and nonlinearities increase, the problem becomes more complicated. And then it is very useful to know something about derivatives (yes, even in statistics, sometimes you need to derive some function).

Examples

In the AI world, there are a large number of frameworks available as opensource. Similarly, there are a large number of examples available on github that you can just download, install, run...

No, it's not that simple. Most of the AI software available on github doesn't work. However, this is not the fault of the applications, but a consequence of the rapid development of the entire field of AI. There are a number of problems associated with this:

different frameworks (TensorFlow, PyTorch...),
different versions of frameworks (TensorFlow 1, TensorFlow 2),
different versions of Python (2.7, 3.6, 3.7, 3.8, 3.9, 3.10...),
various versions of other packages...

Often, you can easily ignore other frameworks. But in specific situations you are forced to use different framework. For example, if you try to run a Nvidia Jetson computer to detect objects in a video stream, you'll probably have no choice but to use TensorFlow. The time spent with other frameworks may be wasted, depending on the combination of HW and SW. Neural networks are not as versatile and hardware independent as other software.

TensorFlow exists in two different version. The two versions are very different from each other in their approach to neural network writing. If you want to learn TensorFlow 2, you'll quickly find that a ready-made example only exists for TensorFlow 1. Even in TF1 it doesn't work very well, because it's for a too old version of TF1.

The classic task used in AI is image recognition. There are a number of different architectures and ready-made, trained networks for this task can be downloaded from the internet. TensorFlow exists in two different versions: 1 and 2. The versions differ quite substantially: TF1 uses much more often the mathematical notation of the neural network, while the TF2 uses Keras. TF2 allows the entire network to be written in a way that is closer to the thinking of the average developer.

Python is a complication, too. For those interested in AI, Python is a resource, they'd rather not deal with, but the differences between 2.7 and 3.x are significant (if one tries to run an example downloaded from the internet). The differences between 3.x versions are not so significant, but they can be quite annoying if the example uses finesse from a version newer than you have installed on your computer.

Software often needs a number of additional helper libraries. This is where all hell breaks loose. Python has its own PIP packaging system, where individual packages depend on other packages, often specific versions, so sometimes installing a "ready-made" example from the internet takes a huge amount of time, and to run such a package you have to create your own environment with carefully chosen versions of the libraries you need. On x86 this is still manageable, but on the ARM platform (Jetson) packages are usually not precompiled and it must be recompiled on the target platform. Unfortunately, PIP is a nutcase and compiles in a single thread only. So as a result, installing simple code from Github can take all day.

Docker can help a lot with creating and maintaining an operational environment for AI. So to get a simple example from the web up and running, you'd still have to become an expert in virtualization...

Data availability

Today's leaders in artificial intelligence are big companies like Google. I don't mention this one by accident - the TensorFlow framework I'm learning to work with was created by Google. All the software that Google has is available to others. So why is Google the leader, why not someone else?

The answer is: data. Big IT companies have huge amounts of data that are the most important part of corporate know-how. Facebook can train facial recognition on literally billions of photos. Not only does Facebook have those photos, people will even tag the photos themselves.

NVidia can recognize car license plates in its DeepStream technology. The neural network has learned how to use the tagged photos to find the registration and read the text on the plate. Nvidia says that the network is trained on three hundred thousand images of license plates from the United States. That's the volume of data that an individual or a small business doesn't have the ability to prepare.

And even if you already have the data, working with it can be math intensive. For neural networks, data needs to be normalized, and normalization procedures vary in depending on the statistics distribution.

Network training

Training the network is a joyful process of watching a total retard become an intelligent entity. I mean... sometimes. At this point, it must be acknowledged that neural networks usually learn very readily, but you see the process as just a few curves on a graph in TensorBoard. However, that doesn't tell you anything about what the network can actually do.

At this point, military AI experts trying to teach a computer to recognize aerial images with tanks are a great example. Thousands of images was presented to the neural network and all the training metrics produced excellent readings. But in reality, the network achieved very poor results. How is that possible? The images with the tanks were taken in a nice weather. The neural network has learned to distinguish good weather from bad weather. The tanks didn't interest it.

In the same way, anyone can be fooled by the neural network.

Training is a time-consuming process. Along with the above case, an enormous amount of time can be completely wasted.

However, sometimes it is very difficult to force a neural network to learn. I'm currently experimenting with Attention OCR. As long as I train on a set of up to 33 characters, I'm fine. If I add one extra character to the training set, Attention OCR stops learning, and instead of minimizing the error, it does the exact opposite - the error it grows exponentially, and the network gets dumber and dumber with each step. If your neural network starts behaving like this on a Friday afternoon: "I'll run training over the weekend", you could lose a huge amount of time.

It also takes a lot of time to find such a bug. One cycle can take ten to fifteen minutes. You won't know about the error until then. If you can't find the error, it's a lot of work over many days and the whole situation is very frustrating.

Necessary hardware

The main computational tool for training neural networks is a graphical card. There are some tasks where the graphics card is not apropriate and it is better to train on the CPU, but such tasks are definitely in the minority.

AI tasks today can do magical things: they can read, talk, they can recognise a grandma from a big beet, they can restlessly control the production line... but it comes at a cost. A simple network in our autonomous vehicle contains over 6 million variables and in it does billions of mathematical operations (addition and multiplication) to process a single image. The input dataset contains less than ten thousand images and a single training cycle can take several days. The amount of computation, even for small tasks, is simply enormous and requires as much powerful hardware as possible.

In graphics card calculations, the amount of data transferred to the graphics card is a limiting factor. When training, every bit of image data must be transferred to the GPU. Then the calculations are performed and everything repeats with the next image. You can speed things up a bit by transferring several images to the GPU at once and train the neural network in batches. On a larger number of images, thinks go faster. But the batch training requires more memory in the graphics card. On the other hand, it is often possible to train in multiple GPUs at the same time.

Along with the amount of memory in the GPU, the memory requirements in the computer also increase. I very quickly found that to work with a 16GB GPU, 32GB of RAM in computer is not enough. I doubled the memory after only a week.

The resulting effect is quite unpleasant - for neural networks we need very powerful graphics card with as much memory as possible. Better, if we we have more cards like that. At the same time, we need to have adequately powerful computer. Forget about being able to train a neural network on your usual desktop or laptop. When the training is running. the screen practically stops. Training neural networks requires a dedicated computer. However, a look at the current prices is frustrating.

Indeed, cryptocurrencies have the same hardware requirements and the GPU become very expensive and unaffordable. Only the graphics card manufacturers are winning.

My unattainable ideal is cards like MI100. But the price for such a card is completely sane and could lead to such desperate acts such as subsidies.

Computing power isn't the only delaying factor. Neural networks work with massive amounts of data. Especially in conjunction with image processing. you can very quickly get into a situation where the volume of a single project can run into the tens or hundreds of gigabytes.

Time consumption

The amount of time you spend training networks is enormous. See hhe table with the time consumption of the thispersondoesnotexist.com site. With eight Nvidia Tesla V100 graphics cards, network training took roughly 10 days. Try to estimate how long training would take on one card... your card.

The amount of time is very frustrating when you are trying to debug neural network.

Everything takes a long time when working with neural networks. Python is interpreter and reports error at the moment when it encounters it. When this feature of Python is mixed with the speed of neural networks, then even trivial errors are detected after a few minutes, at best. Such error are normaly found in seconds by compiler – my primary language is C++.

When the training finally runs successfully, you can get the feedback in a few hours. Until then, you have only little indication that the training is going well.

Python

Personally, I consider Python to be a rather obscure language. For outsiders, Python has one special thing that I I haven't seen in any other environment, or at least not to this extent: when programming in Python, you should use the "Right Python Way" Python is just different, and it is proud of it.

For those interested in artificial intelligence, Python has one more tricky feature. Originally, Python is designed to teach programming. Because of this, Python has taken well in education and academia. Because the whole field of artificial intelligence moves very slow from research to practice, quite a lot of the examples and libraries come from academia. For many of the examples you can find links to arxiv.org to research papers.

Academics love mathematics. Mathematics is a great tool to describe the world around us. Mathematics uses symbolic notation, but this is essential flaw - without knowing the context, mathematical notation is not readable. Can you see that the following equation describes the arithmetic mean?

x = \frac{1}{n} \sum_{i = 1}^{n} a_{i}

This symbolism often appears in the code, where context is even harder to obtain and examples are often completely unreadable:


x = s / n

versus


aritmetic_mean = values_summary / values_number

Academic Python code therefore often contains all the letters of the Latin alphabet and transliteration of the Greek alphabet, but less often you can find descriptive names like iterator_over_images, number_of_images nebo input_shape.

Inference

Inference is the process when you ask the trained neural network to make a prediction. At the moment the network is deployed in the real world and it has to perform task it was trained for. Surprisingly, this step is one of the most challenging.

First of all, the difficulty of inference is due to all the tutorials and examples on the Internet that show how easy it is to work with neural networks:

install TensorFlow,
write this program,
download the demo dataset,
train,
check out the TensorBoard,
profit!

The majority of tutorials are missing the most important thing that makes the AI useful. They all show you how easy it is to work with AI, but only a few of them shows you how to use the trained neural network.

You will find neurons connected with synaptic lines in any beginner's text but no one shows you how to perform inference with the trained neural network. Often you will get only instruction "take a tensor shaped (1,3,320,320) and do the inference". This is not very illustrative. Rarely the REST api or communication with the database is mentioned in the examples.

Successfully deploying a trained network is a challenging thing for another reason, too: the scope of knowledge required to deploy the network is very different from the knowledge you need for to train AI. We usually want to deploy AI in some complex, organic environment:

an error is detected (visually bad product),
the error is highlighted on the screen with a description and awaits an operator action,
after interaction with the operator, the serial number is detected (reading QR code, barcode code),
the error is entered into a database somewhere in the cloud,
the production line is given the command to scrap the product.

In addition, the AI programming environment differ from UI environment. While AI tends to be mostly in Python, our UI programs are often written in PHP (for the web) or C++ and QML (for various embedded applications).

Real world deployment

A neural network can watch the office and store in a database every person visiting the office (I know, GDPR). Or it can control a barrier in your company car park and only let in cars with known number plates. Of course, there's no problem running such a network on a regular PC with a slightly more powerful graphics card. But try suggesting it to a customer. If you want to sale your solution you need to offer it as cheaply as possible.

You can certainly run TensorFlow Lite on a Raspberry Pi, and for simple networks the performance will be fine. The CPU power will be enough if you connect a camera to the Raspberry Pi and try to build a system to visually detect when a window is open or closed. You can check the picture only once in a minute or on demand only and the CPU will be powerful enough.

But when you want to process FullHD video, the Raspberry PI becomes a huge drag.

Fortunately, there are various AI accelerators. I can name at least two:

Edge TPU,
Nvidia Jetson.

Edge TPU can by bought as a dedicated USB device, but usualy the AI accelerator are placed in computers like the Raspberry PI with ARM processors (Coral, Asus Tinker Edge T)

Both accelerators have completely different characteristics. Edge TPU only works in eight bits integer resolution. The neural network for the Edge TPU must be converted to TensorFlow Lite and adapted to an eight-bit environment.

We use Nvidia Jetson Nano in some projects already:

Jetson can also handle decimal numbers with 32 or 16 bit precision, but the conversion won't miss you anyway - if you want to use the AI accelerator to its full potential, you have to convert the network from TensorFlow to TensorRT for that particular machine you want to run the network on.

The conversion is often a huge problem. With TensorFlow Lite, I have very little experience and it's rather good, but with TensorRT I've had many unpleasant experience. TensorRT is designed for inference only. The functions that are in the neural network associated with training are unnecessary and not supported in TensorRT! One such typical function is Dropout, a regularization method that, when training, randomly turns off some neurons during training. This function is completely unnecessary during inference. Therefore, the neural network must be thoroughly reorganized after training. Without knowledge how use matrix operations the network can be very challenging to modify.

Sometimes some of the basic functions are not supported - for example the split function (splitting a tensor in two). Such functions have to be written in C++ or C and linked to the TensorRT engine externaly.

The TensorRT diagnostics are absolutely insane. The frequent error message is just "Assert243" without explanation. Only a few days of searching can show you, that 243 is a line number in the source program somewhere in the DeepStream examples. But you won't be able to find it, and therefore you won't try to find it, because on the developer forum you will learn that TensorRT is a proprietary technology and the source code is not available.

For ready-made networks downloaded from the internet, the situation is much more complicated. For example, ready-made, trained image recognition networks typically have a tensor with dynamic dimensions. TensorRT can't handle this and the entire graph has to be rebuilt and such layers has to be replaced with fixed dimensions tensors. Nvidia offers graphsurgeon tool, but the documentation is very sparse.

Nvidia offers DeepStream technology for video processing. I really like the idea and I have written about it before. Number of examples can be found in installation package. The examples should cover DeepStream technology, but they are written in such a cryptic way that the developer without a detailed knowledge of Gstreamer will try to create his own simple stream for several days. Then he can work on the AI.

The Nvidia Jetson is a nifty computer with a low price, but it's still a marginal pruduct with a small user base. Per a thousand people with Raspberry Pi experience, there will be maybe one or two with experience with Nvidia Jetson. The Edge TPU is be very similar. When developing software for AI on those platforms, you'll always be on your own, and nobody will be able to help you with your problems.

Conclusion

Artificial intelligence and deep learning is an interesting area of computing. However, working with artificial intelligence is different from other computing stuff.

As developers, if you want to embrace artificial intelligence and use its capabilities in your projects, you will have to master a huge amount of new knowledge.

The idea that neural networks are simple is wrong. Today a programmer doesn't normally deal with binary trees, hash tables and other ways of organizing data in memory. But the knowledge of concepts helps a lot. In AI area the programmer don't have to deal with the implementation details of neural networks. But the knowledge of concepts helps a lot, too.

With neural network knowledge, the situation is similar. The volume of skills needed is not too different, but the content is completely different and there is only a little overlap with the knowledge of an average programmer.

Working in IT means constant learning. The AI is not exception.