November 21st, 2020 by Maarten Vinkhuyzen
Special thanks to Lieuwe Vinkhuyzen for checking that this very simplified view on building neural nets did not stray too far from reality.
The inhabitants of the Tesla fanboy echo chamber have heard regularly about the Tesla Dojo supercomputer, with almost nobody knowing what it was. It was first mentioned, that I know of, at Tesla Autonomy Day on April 22, 2019. More recently a few comments from Georg Holtz, Tesmanian, and Elon Musk himself have shed some light on this project.
The word “dojo” may not be familiar to everybody. It is a school or training facility for Japanese martial arts.
The dojo Elon Musk has been talking about is also a kind of training school, but for computers. It will become a supercomputer specially designed to train neural networks.
If you know what NN, ASIC, or FPGA stands for, skip the explanations. They are not really part of this article, but are useful basic information for people without an IT background.
Explanation of Neural Nets
As long as there have been working computers, starting with those giant vacuum tube machines as big as a house, there have been programmers trying to make them intelligent, like a human being. Those first versions of “AI” (artificial intelligence) were really primitive. (When going out, ask: Is it raining outside? When the answer is “Yes,” get an umbrella. Otherwise, do not get an umbrella. This was in an AI program in the UK, of course). It consisted mostly of long lists of IF … THEN … ELSE … statements.
When the art of programming advanced, we got rule-based programs with large tables of rules, composed of the answers of topic experts who were questioned for days about what they knew and how they got to conclusions. These were called “knowledge programs,” and some were even usable.
While programmers tried to make a program that behaved like a human, neurologists were researching how the human brain worked. They found that the brain consists of cells (neurons) connected by threads (axons and dendrites) to other neurons. Using those threads, the neurons send signals in an electrical or chemical way to those other cells. These brain tissues became known as biological neural nets.
These biological neural nets became the model used by the most ambitious developers of computer-based artificial intelligence. They tried to copy the working of the human brain in software. It was the start of a decades-long journey of stumbles, roadblocks, failures, and slow but steady progress. The “Artificial Neural Net” (just NN for short in IT and computer sciences) became the most versatile of the artificial intelligence programs.
There is one very big difference between these NN and the more traditionally programmed knowledge programs. Traditional programming uses IF-THEN-ELSE structures and rule tables. The programmer decides what the reaction (output) will be to a given event (input).
The behavior of a NN is not programmed. Just like a biological NN, it is trained by experience. A NN program without the training is good for nothing. It extracts the characteristics of “right” and “wrong” examples from the thousands or millions of samples it is fed during training. All those characteristics are assigned a weight for their importance.
When a trained NN is fed a new event, it breaks it down into recognizable characteristics, and based on the weights of those characteristics, it decides how to react to the event. It is often nearly impossible to trace why an event resulted in a specific reaction. Predicting what the reaction will be to an event is even harder.
An empty NN, a blank slate, is not AI. A trained NN can become AI. Where a knowledge program reacts in a predictable way to a programmed event, a well-trained NN reacts in an original way to an unknown event. That reaction should be within the parameters of what we consider a “good” reaction. This creates a complete new set of challenges in testing a trained NN. Has it become AI, and is it smart enough to delegate some tasks to it?
Explanation of ASIC
When most people think about a computer or their telephone, they are vaguely aware that there is a piece inside that makes it tick. This piece is known as “the chip.” For the more technologically advanced, it is the CPU, which stands for central processing unit.
This is a modern technological marvel. It can compute everything it is asked to compute. But like a decathlon athlete or a Swiss army knife, it is not the best at anything. Early on, specialized helper chips were developed — small chips that could do one thing extremely well and very fast. They were the keyboard controller, numeric co-processor for doing sums, graphic chip for painting the screen, and chips for many more functions — like sound, encryption, input-output, network, wireless signals, etc. Together, they are known as application specific integrated circuits, or ASIC for short. They can do their tasks better and faster than the CPU, and free up the CPU to do all the other tasks that are not delegated to an ASIC.
What makes these dedicated chips faster than the CPU speed monster is that the software the CPU executes is replaced by hardware that can only execute the instructions it is designed for. A set of instructions (aka an algorithm) can be up to a thousand times faster when it has its own hardware.
In the Full Self-Driving (FSD) chip designed by Tesla that is the heart of Autopilot HW3.0, there are about half a dozen instruction sets that are executed billions of times. These are replaced by dedicated circuits that make the Tesla FSD chip faster than any chip not designed to run the Tesla Neural Network.
At the Tesla datacenter, the neural network is trained on a large supercomputer, far too big to have in a car or even a large semi truck. It occupies a building. For training the neural network, there are other sets of instructions that must be executed on this supercomputer trillions of times. Doing those on dedicated circuits can speed up the execution of those instructions by a few orders of magnitude, as Elon likes to say.
Explanation of FPGA
Making chips is expensive — so expensive that companies like AMD and Nvidia do not make their own chips anymore. That is outsourced to specialized foundries. If there is a bug in the code that is hardwired onto your chip, after you have the chip baked at a foundry, you might have turned a few hundred million dollars into paperweights. Not the best use of your money.
To make sure the designed chips work as intended, you need to test them before you make them. That is like tasting the pudding before you make it. It is not easy.
There is a special kind of chip, called a “field programmable gate array” (FPGA). It is an impressive name, and I have no idea what it means or how they work. I just know roughly what they can do.
These FPGA can be configured to another hardware layout after they are baked. They can be made to behave like the algorithm is hardcoded in the chip. FPGA are used when you need the added speed of an ASIC but a real ASIC is too expensive or takes too long to make. A FPGA is not as fast as a dedicated/baked ASIC, but it is still a lot faster than running software. These are mostly used for small series in highly specialized machinery, for research and development, and for prototyping.
With the use of FPGA, you can make a “proof of concept” of the chip and computer you are designing and debug the code you intend to hardwire into it. This significantly lowers the chance on making million-dollar paperweights.
Elon Musk recently said that its Dojo supercomputer is now 0.01% ready and should be operational in a year. That comment was more than confusing. Being at 0.01% and being ready at 100% in just over a year? That did not add up. When you are at 0.01% after two or three years of work, are you going to do the other 99.99% in less than a year?
New information revealed that the 0.01% comment was about the working prototype used to validate the design of the Dojo supercomputer. The Dojo prototype was working on FPGA (Field Programmable Gate Array) chips.
The FPGA prototype computer is described by Elon as only 0.01% of the size of the intended Dojo computer. I think the 0.01% is more a figure of speech than a real measure of the size. It is only a very tiny computer compared with what the Dojo will be a year from now.
In the R&D department tasked with development of the Tesla FSD (aka Autopilot) system, there are not only ~200 software Jedi masters working on the Autopilot software, but also more than a hundred hardware engineers tasked with building the Dojo supercomputer. (See the CleanTechnica exclusive “Tesla Autopilot Innovation Comes From Team Of ~300 Jedi Engineers — Interview With Elon Musk.”)
The challenges for the Dojo supercomputer are the heat produced, the amounts of data that have to be moved from the storage systems to the computer’s internal memory, and the speed of execution of the NN training software. The execution should not be paused to wait for the delivery of new data from the storage system. To conquer the heat and data transport problems, they just need a lot of money to implement the best solutions on the market today. This article is about the speed-of-software-execution problem.
An algorithm coded in the C programming language versus the same algorithm hardcoded using transistors in the chip have vastly different speeds of execution. The hardcoded algorithm can be a hundred to a thousand times as fast. This does not mean that the Dojo computer can train a neural network (NN) in a day whereas a computer the same size using optimized C code takes perhaps 3 years for the training. All the other limitations still apply, and a lot of code will still be in software running on normal hardware. How much the training is accelerated will hopefully be revealed by Tesla when the Dojo is taken into production.
Tesla Autopilot is a NN that is trained in a large data center using the huge pile of data Tesla has collected. All the Tesla cars on the road with FSD software onboard (active or running in shadow mode) register traffic situations. The situations that can be used for training the NN are anonymized and uploaded to the Tesla datacenter. Once the NN has learned how to drive, it is downloaded to the cars using over-the-air (OTA) update technology.
For a long time, I have tried to understand what a NN is and what training a NN means. It is not a program like the ones programmers typically write. Instead, its actions and reactions are not programmed into it. It evaluates input data using rules and reference examples it has created itself during its training.
I think at first it is a huge, empty program to execute rules without rules or reference data in it. All the placeholders for the rules and data have yet to be filled. After the training, it becomes a program that can perform tasks like a human within the confines of its intended function.
Professionals in the AI field call all versions a NN. For clarity, I use the term neural network for the “blank sheet” state, for the already very complex software before it is being trained. I use AI for the trained NN that is able to perform its intended functions after training. Those trained NN that are not able to do what is expected are just failed attempts, good for teaching the trainers what does not work, where they have to improve the NN software or the training data sets.
Perhaps the best way to visualize it for us, NN noobs, is to think of it as a big empty spreadsheet with many tabs, but nothing defined yet. There are large formula libraries, many datatypes we can use, and a strong macro language.
Training the NN is analogous to using a specialized program and a huge data repository to fill the spreadsheet. This program is used for extracting information from the data, aggregating, correlating, finding common aspects, looking for cause and effect, and then storing these aspects in the cells. Next, this program is defining the relationships between the cells with formulae, adding rules for interpretation of the results and to generate reports and graphs based on parameters you can enter.
What is in this spreadsheet-based program is not the data or even the aggregation of the data that is used in the training. It is not is a huge repository of all those examples that are used in the training. That is building a standard data warehouse and using normal reporting technology.
What the training does is turn the data into rules and descriptions. Some rules are more important than other rules, and some descriptions are preferred over others. No human programmer has written those rules or descriptions or calculated their importance. It is the same kind of training that turns a human baby into a capable adult.
Depending on the way the spreadsheet is filled and configured, it can be a general ledger system, an inventory system, a stock trading or marketing system, or perhaps a brilliant tool to run a political campaign or play StarCraft. It depends on the examples of good and wrong data that are used to train it. What the system will do depends on what data are used to train it.
An example of training the NN to become a functioning AI: The goal is to discover new molecules that could be used as medicine. First select an empty NN of the desired size and complexity. Then collect thousands of chemical formulae that have been tested — in this example, the formulae of 100,000 molecules. Half of them have positive effects and are labelled “good,” the other half is labelled “bad.”
Use a random 90% of the examples to train the NN. Then feed it the other 10% with the instruction to determine the correct label. When the NN attaches the same label as was discovered during the previous testing for most of the test set, you have working AI. Otherwise, you might need more data, a bigger or smaller NN, or perhaps a differently constructed NN. Rinse and repeat.
For testing, a different dataset is used than for training. The NN does not contain a compressed dataset of its training material, indexed and organized in a way that it can quickly look up the label associated with a substance. That would be data warehousing, or another kind of database querying. AI can apply the learned rules onto new situations. That is why you use test data that was not used in the training. What is described here is the simplest testing method. For large and complex systems, there are much more complex and demanding testing methods.
In your car, the AI runs on a different computer than the Moloch that was used to train the network. The difference is one of scale: the Tesla HW3.0 FSD computer that runs the AI fits behind the dashboard. It processes the input from the sensors in real time and decides on the appropriate action faster than a human can.
The Dojo supercomputer with all its supporting network and storage requires a datacenter in a building. The system that trains the NN can supply not gigabytes or terabytes but petabytes, or even exabytes of data, to the NN software and execute the training algorithms. There is no room for this amount of data or this kind of processing power in the FSD computer behind the car’s dashboard. Only the rules distilled from it by the training computer are part of the trained NN, the Autopilot AI.
When a human programmer alters a large software system, the goal is to alter as little code as possible and to not change the working of the rest of the code. The testing is based on knowing exactly what code is changed and what code is not changed. To verify that the modified code and all the old code still work as intended, programmers use unit testing, regression testing, and a suite of other methods to assure that the modification did not alter the functioning of the system outside the intended change.
The structure of the rules and relations of a trained NN are unknown. Therefore, a programmer cannot alter them. The only usual way to alter the NN is wiping it clean, extending the training dataset with examples of the new functionality, training the NN with the new dataset, starting from zero. This cycle is repeated for every update, every correction of the AI. The new data can influence all the rule making during the training, far outside the functions it is intended for. Think of it as the ripple effect of a stone thrown in the water. Because there is often a completely new AI after each update cycle, all of its functionality has to be tested.
This is the big difference between programming by a human and training a NN with a computer. You can not go in and just alter the faulty line of code — at least large parts of the system are rebuilt. Testing the change is equally more complex.
The famous StarCraft AI, which could beat 99.8% of human players, was trained in three days. But building the dataset and designing the NN took three years. During these 3 years there were many training and testing cycles before the result of the final training session was good enough. The FSD AI is much more complex. It is trained with a lot more data. It has to be developed, using Elon’s favorite expression, orders of magnitude faster than that of StarCraft AI. Otherwise, it would be next decade, if not next century, before FSD and robotaxis became reality.
Originally, 2D still frames were the sensor input for the FSD AI. When the usability was not good enough of those frames, when they reached a maximum in what they could achieve with such data, Tesla switched to better input. The next was probably stitching multiple frames to one panorama view. After the stitching came adding a software generated cloud of lidar dot data to the frames, creating 3D images. After each improvement, another local maximum was reached in what could be achieved with the data. Passing such a local maximum required better input and a more powerful NN. By adding time we now have 4D video data as input to the AI.
In between the reaching of local maxima, there were many iterations of labelling and expanding training data sets and improving the NN software. It was a cycle of improve, train, test, with ever increasing datasets fed into the training algorithms of a more and more complex software system. Testing evolved from driving in a single highway lane to driving from origin to destination over multiple highways and through cities.
Dojo will likely not only be a trainer, but also a platform to drive millions of miles on simulated test routes. Simulations are not good enough for training, but complex situations derived from real-world data can be excellent for initial testing before real-world testing of Alpha and Beta releases.
The first early Beta version of the FSD software is being released to a select group of customers for testing. Call it version 0.92.n.nn of the FSD system.
This FSD system is nearly functionally complete, but all the functions are in need of lots of improvements. FSD is not a monolithic system. It is composed of many parts that perform different functions, parts that collaborate and communicate. Many parts are neural networks in their own right.
There appears to be a contradiction between now having the Beta out for testing and hopefully having a working system within a year, versus needing the Dojo computer for development, which will become available in a year at its earliest.
The system that is complete and working in a year will be a system that still needs supervision. It will be good, even amazing. It will not be perfect. Think of it as version 0.97.n.nn. For further improvements, the law of diminishing returns will require ever bigger efforts for ever smaller increases in reliability.
A good driver follows the rules and is predictable. There are many differences in traffic regulations: Driving on the right side or left side of the road. Should you keep your lane or keep to the right. For example, overtaking on the right can cost you your driver’s license in the Netherlands. It is not just a simple traffic violation like parking or speeding. There are different habits — in the Netherlands you adjust your speed after passing the speed sign, in Germany you do it before you reach the sign.
Tesla is not finished with FSD development when a car can drive from Los Angeles to NYC or from Seattle, Washington, to Tampa, Florida. That is not even the level of an average driver. The FSD AI has to become better than 99.999% of drivers in all situations. After that, it has to learn to drive like that in about 200 jurisdictions with all (slightly) different rules, regulations, and customs.
There is still an awful lot of training to be done in Tesla’s FSD future. The tailor-made CPU for the special Dojo computer is needed to reach the required speed. Current hardware is just not fast enough to create all the FSD AI systems in time.
Have a tip for CleanTechnica, want to advertise, or want to suggest a guest for our CleanTech Talk podcast? Contact us here.
Latest Cleantech Talk Episode