homeblogscinet and puppet creating ephemeral hpc network

SCinet and Puppet: Creating an ephemeral HPC network

SCinet DevOps Team in action small

“Hey, look at this!” my SCinet colleague said, pointing nonchalantly at his screen, where I could see a 600-gigabit Link Aggregation Group (LAG). He had just finished bonding a series of WAN connections that would be used to facilitate high-bandwidth, low-latency connections with the wider world of research networks during SC16.

Less than 15 minutes later, as I walked past the team handling the show’s wireless networking, I saw the network diagrams to support everything that over 12,000 attendees will expect to connect to a wireless network. (Admit it, you have at least two wireless-capable devices within arm's reach right now.) All this frantic rush and infrastructure is set up in two weeks, used for one week, and then is torn down within 24 hours. Yes, it's surreal — but in a good way.

By way of quick introduction, my background is in high performance computing (HPC). HPC is computing for research and simulation; computing with a STEM twist. In traditional IT, the goal is to have service availability. Whether checking e-mail, consuming media or starting up a game, we expect things to just work, and quickly. If an app takes too long, we soon grow impatient and frustrated. HPC, on the other hand is concerned about throughput. Engineers, researchers and scientists combine data and code into HPC workloads, put them in a queue and wait for the system to calculate the results. Those responsible for administering an HPC system are therefore charged with having it churn through the workload as efficiently as possible. Throughput is key to HPC.

So what is SC16, and why should I care?

It may surprise you to learn that in recent years, one out of every five dollars spent on servers annually goes to systems used for technical computing. (See this article on IDC's tracking of HPC, and this article on the growth of worldwide server sales.) Because HPC has historically been so different from traditional IT, both in infrastructure and operation, organizations are developing specific software and hardware ecosystems to support it. And HPC users and administrators have needed to develop specialized skills to keep workload throughput high, and the systems running at maximum efficiency.

That’s where the SC conference comes in. Since 1988, this conference — the largest related to supercomputing — has served as a place where people can discuss all things HPC. Of course, it needs a network. And that’s where SCinet comes in:

Volunteers from educational institutions, high performance computing centers, network equipment vendors, U.S. national laboratories, research institutions, and research networks and telecommunication carriers work together to design and deliver the SCinet infrastructure. Industry vendors and carriers provide much of the equipment and services needed to build the local and wide area networks. Planning begins more than a year in advance of each SC Conference, and preparation activities culminate with a staging event in late October and a setup event just seven days before the conference begins. ~ From Meet the teams on the conference website

The cats herding the cats

After learning that SCinet has been using Puppet for years, and initiating a discussion with some of the SCinet leadership, I was invited to participate in this year’s production of the conference network.

It’s easy to see why configuration management and a DevOps approach can help when you're standing up hundreds of systems and switches, each with a unique role to play — only to tear them down a couple weeks later.

SCinet is a global collaboration of high-performance networking experts who provide the fastest and most powerful volunteer-built network in the world for the SC Conference. Over 200 volunteers participate remotely and in person for this gig. They come from all over the world (18 countries by my count), from universities, research institutions and (like me) industry. I’m really impressed by the caliber of people and experience that I’m working with. There’s a breadth of experience, diverse viewpoints and enthusiasm for this work that’s both challenging and invigorating. How else would 56 miles of fiber get laid down to support all the researchers demonstrating research, vendors demonstrating solutions or attendees using apps? Some of the equipment is volunteered, as well. Various vendors share their latest and greatest products to demonstrate SCinet’s capabilities.

The SCinet horde is divided up into teams by discipline. For instance, I’m on the DevOps team. There are also teams that handle routing,network security, WAN connectivity, fiber, power, edge networking, etc. Each team is relatively flat, with a small group of experienced team leadership and then several volunteers, including students. Some members (particularly the team leads) are veterans and know the ropes and process because they have been to several of these. Others (like myself and most of the students) are here for the first time, and are learning the ropes.

The early days of my participation (April – Late October 2016)

Early this year (after the SCinet team emerged from post-SC15 hibernation), I joined the video calls and was given access to the team's Puppet code. They had been steadily working on increasing their Puppet code over a few years, and were using several modules they had written themselves, plus a few from the Puppet Forge. The team was using the Puppet 3 in a masterless configuration.

My first project was to go through the code and shine it up. Specifically, I went through and made sure the code was easy to read. This translated to linting the code, refactoring deprecated Puppet code and removing tab characters. (See the Learn more section below for information on deprecated Puppet code.) We also discussed which DevOps philosophies we were going to focus on this year. The team members were all very approachable and easy to work with; as a new member, I needed a little hand holding and sharing of institutional knowledge. I found the team wiki to be very thorough, but like all thorough wikis, some sections occasionally fall out of date. There is a fair amount of excitement with the documentation-as-code aspect of Puppet’s philosophy, and the fact that documentation as code decreases the cost of maintaining such a large amount of external documentation.

Another aspect of DevOps practiced by the SCinet team is use of automation to reduce repetitive tasks. For instance, each year the domains used by the show and SCinet change — sc15.org becomes sc16.org — and the need to update the fully qualified domain names (FQDN) of systems that have been off for 11 months has many ramifications. Not only does the base networking need updating, but DNS, DHCP, logging, and other core services need updating as well. Additionally, issues like generating and issuing new certificates and other trust relationships present ripe opportunities for automation.

The first stage of work, from April until staging in October, was focused primarily on understanding the existing code base and planning out the changes I would suggest to make the code base easier to read, easier to maintain and easier to manage. Additionally, there were potential improvements to be made in migrating from the masterless Puppet model the team had been using to a masterful model. Some modules would benefit from refactoring, too. Workflow improvements would be appropriate, and mitigate risks of an increasingly broad group of contributors. Finally, I hoped the SCinet DevOps team’s work could be useful to some of the other volunteer teams that help make SCinet a reality.

Paul Anderson is a senior professional services engineer at Puppet.

Learn more

  • For more about deprecated Puppet code, see the deprecated language sections of the Puppet 3.8 reference manual and the Puppet 4.8 reference manual
  • About SC16: SC16, sponsored by ACM (Association for Computing Machinery) and IEEE Computer Society, offers a complete technical education program and exhibition to showcase the many ways high performance computing, networking, storage and analysis lead to advances in scientific discovery, research, education and commerce. This premier international conference includes a globally attended technical program, workshops, tutorials, a world class exhibit area, demonstrations and opportunities for hands-on learning. For more information on SC16, please visit http://www.sc16.supercomputing.org/, or contact communications@info.supercomputing.org for more information.