Optimizing Your Work with High-Core Count Machines
Server processors from Ampere, AMD, and Intel have increased dramatically in core counts in recent years, and the trend is continuing. What can you do with so many cores, and what are the pitfalls?
Computer manufacturers are making multi-core systems the norm for consumers. Even mobile devices like the iPhone tout the number of cores, including those dedicated to machine learning. When it comes to data center hardware, the number of cores has increased even more dramatically, due in no small part to the massive workloads (and substantial purchasing power) of public cloud providers like AWS and GCP. Chip manufacturers like AMD, Intel and others have responded by releasing increasingly higher core count parts.
Going from single-core systems to 4-core or even 8-core systems seems to be the limit of most people's day-to-day computing experience. But as more users spend time in front of systems with 32, 64, or 96 cores (the latest Ampere Altra has 128 cores!), there is a bit of a learning curve.
So how can you optimize your work when moving over to these beefy systems? Users need to understand their workflow, create new processes, and manage expectations to get maximum output from these high-core count computers.
Wrap Your Head Around Your Workflow
To be able to benefit from a large core machine, you first have to set expectations and organize your work to better fit that environment.
The first step is to understand your workload well enough that you can break it into enough pieces to take advantage of all the separate compute units you have. It might come down to restructuring makefiles or build systems to be able to test the number of jobs running at the same time (and maybe not aiming to run the entire system all at once).
Machines with large core counts can run into memory access problems with NUMA (non-uniform memory access). Memory is not all the same, and there is a difference between on-chip and off-chip memory as well as different sockets corresponding to specific banks of memory. Socket 1 has to talk to Memory Bank 1 as a baseline. If you have memory access go from Socket 1 to Memory Bank 0, you could slow down the workflow by a factor of 10. While some programming languages try to take care of this, it is a priority to make sure your workflow is utilizing a larger machine as efficiently as possible.
Getting Stuck in a Single-Threaded Slowdown
To paraphrase parts of Amdahl’s Law, single-core performance still matters, even in a massively parallel environment.
If you have a process that's operating in parallel but have one step that's single-threaded, either at the end or at the beginning or at some bottleneck along the way, then no matter how fast your parallel process goes, you're still going to be bottlenecked by that single-threaded performance.
Take, for example, the Unix kernel. You can utilize a lot of cores to compile all of these files but in the end, you have all of these object files that need to be linked together at the same time, into a single kernel. Kernel compile times do shrink down considerably, but more cores do not improve the entire process, just the parallel process.
How Fast Can This Thing Go?
You wouldn’t be able to run something unmodified on a 160-core machine at the same level of performance as your quad-core desktop with all of the same tools you are familiar with. For bigger machines, you need to implement highly parallelized versions of those same tools.
For example, let’s say you are running a 160GB file. On a 4-core machine, compression using gzip is going to top out on efficiency. But on a 160 core machine, you can use the pigz tool to break up the file into blocks, send out 1GB of the file to process on each core, and then merge them together at the end. At the expense of using your entire system for that time, you can potentially compute 100x faster.
Another issue is memory bandwidth. On a system with a lot of cores, you might end up with a lot of idle cores because you can’t effectively schedule jobs for them. The total process might be bottlenecked because each bank of memory in a multi-bank system has only so many bits per second between the central processor and external RAM. You’ll end up having to spend a lot of time thinking about caching, trying to keep workloads small enough so that they fit onto the CPU core.
It’s a lot of work, but definitely worth it in the end. If you can overcome the challenges of fully utilizing a high-core count machine, then the opportunities available to you are virtually endless.
With More Cores Comes More Opportunity
What has gotten remarkably better over the years is the ”per-core” performance of machines.
Tasks that are naturally easy to break up into parts are described as “embarrassingly parallel” and are finding new life in these machines. Webpages are a great example of that. We basically assign a core to each simultaneous user for their particular needs. Easy as that.
Load balancing systems or web front-end proxies like NGINX really benefit from multi-core systems. Video encoding works well with more cores because you can have a lot of parallel streams going on and they aren’t dependent on one another. In fact, any project that doesn’t have a complicated tangle of interdependencies and can divide workload neatly among separate parts could easily work on a 96-core machine.
Also, high-core count systems make incredibly good build machines. If a software developer needs to have a lot of time and resources spent building the systems that they're going to eventually test on something else, then having a really big machine to do the builds on is a huge benefit. The quicker you can turn around a build, test, and check cycle, the better the productivity of your programmers. By compressing that cycle time, things that used to be done in an hour can be completed in 10 minutes. Developers can keep pace with the rate of change of the software tool base, which can be a huge advantage.
Seeing is Believing
How do you know that these high-core machines are actually working optimally? What tools are best for making them run like a stallion?
On our recent Proximity live stream, we showed some examples of how you can use visualization tools to understand the work happening on these cores.
The tool htop shows the activity of all the cores in the system in a graphical display. At a glance, you can see which parts of the system are busy and which parts of the system are idle. You can also see the Linux scheduler move things from core to core and understand how your computer is managing jobs and core capacity.
When you pin a process with NUMACTL, you can see when all of the jobs are running on only one side of the machine. That sort of data wouldn’t be obvious without these tools and can help prolong the life of your machines.
We also demoed a tool called Glances, which is another single dashboard giving a live look into Linux system performance. It offers progress bars that show the processes that are the busiest i.e. how busy all of your disks are, and how busy your network is. With this kind of data visualization, we can better manage our hardware for live streaming.
These tools are really useful on large core count machines because it can be difficult to extract that data one core at a time. An opportunity in having a lot of cores is that you can look at your machine from a birds-eye view to run and manage tasks, whereas with a smaller core count it can be more difficult to see the granularity of how busy your hardware is.
What’s Next?
As Moore’s Law comes to an end and it gets harder to push the performance of a single core without melting, large-scale parallelization is the path forward for our computing. The large number of knobs that can be twisted to change the performance of the Linux kernel on big systems also suggests an opportunity for AI and machine learning to help the optimization process. We are watching tools like the optimization product Concertio where an expert can use machine learning to zero in on the combination of changes that will make a difference for their workloads.
The lesson here from the past is that as you’re designing systems, it would be wise to prepare for 3-5 years down the road when larger core count systems are the norm. Hopefully, setting expectations and getting the right tools will make the jump a lot easier.
As William Gibson (may or may not have) wisely noted “The future is already here, it just isn’t widely distributed.”