Big IT for CERN's particle smashing experiment

Organised by the UK grid computing group GridPP, that lab and 16 others in the country will offer tens of thousands CPUs to the effort. The tier one set up at RAL is one of the top ten for bulk processing. RAL will supply streaming data, backups and help process the nearly-raw data being pumped out of CERN. The various distributed computers will be connected using private fibre optic cable links as well as high-speed sections of the public internet.

Tripping the light(path) fantastic

Neil Geddes, a director at RAL, told IT PRO that the data is processed in real time from CERN, and is streamed mostly raw to sites like his own over "not quite standard internet." While certain sites will have a hard-wired network, the 10 terabytes of data a day the lab expects to see will travel over dedicated lightpaths on fibre networks designed for academic use.

"Certain colours of light on optical lines are guaranteed for our use," Geddes explained. This allows the huge amounts of data to flow without interruption to either the lab or to the rest of us using the internet. "It's one of the more novel aspects, but it has been around for a while," he added.

While it may not be truly new tech, CERN's team have still pushed the envelope, and have set new landspeed records' in data transfers ahead of the LHC experiment, Geddes said.

Once the data hits places like RAL, it is backed up, reprocessed and sent on to other sites. At RAL, it is shoved onto tape for backup, and then send out for archiving (and therefore additional backup) to tier two sites around the world, where it receives more detailed processing and where scientists extract the signals they need for their work.

RAL will also be doing reprocessing and "serving it to other physicists around the world." For example, if a physicist in California needs access to RAL's data, it needs to be accessible and organised for what Geddes called a "glorified data base query."

Just as the LHC experiment has been ramping up for years, so has RAL's computing set-up. The site uses tape-based backup and has a "few" petabytes of spinning disks, as well of hundreds of racks with several thousand CPUs. In the past year, RAL spent some 2 million on tech equipment for the lab in the past year, much of it going to support this project.

Because of the nature of the computer equipment RAL requires - big and powerful - and the sheer scale of the project, buying kit is a six month process. Installation isn't any easier. "You can't just wheel it in and turn it on," he said. Across the twelve major tier one sites, "not one has had an incident free installation," he noted.

But they're ready now, Geddes hopes. The lab has been running 100 terabytes a day for a long time, he said, and as far as what they're receiving from CERN, not much is expected to change when the beams first collide. What will change is the scientists from around the world clamouring to get at the data, he said. "We've not had several thousand physicists trying to access the data for real," Geddes said. "Being faced with many more demanding users, accessing it in an ad hoc way, is our biggest challenge it's large-scale processing on top of that we have to be ready for a large-scale user community. That's the real challenge."

At home

GridPP was also last year put in charge of the global LHC@home project.

Similar in idea to the famous SETI project, it lets people without server farms in their homes take part by downloading a bit of screensaver software to their computer which lets physics information be processed when the machines aren't otherwise being used. Based at Queen Mary at the University of London, it essentially allows some bits and pieces of the analysis to be farmed out.

"Like its larger cousin, SETI@home, LHC@home uses the spare computing power on people's desks," said Dr Alex Owen, who runs the project in the UK, at the time the project moved last year. "But rather than searching for aliens, LHC@home models the progress of sub-atomic particles travelling at nearly the speed of light around Europe's newest particle accelerator, the Large Hadron Collider."

As of last year, 40,000 people had run the LHC@home programme, contributing the equivalent of 3,000 years of computing on a single machine. At the moment, the system is focused on understanding the machines optics systems, GridPP's Britton explained.

The future of grid computing

Just as Sir Tim Berners-Lee came up with a world-changing solution when he created the World Wide Web to organise information, some believe the work being done because of the LHC project could propel distributed computing into the mainstream. Contrary to what some believe, the grid is not a replacement for the internet; it's a different way of sharing and analysing data, especially large amounts, using networks such as the internet.

Grid computing is already used for medical and meteorological modelling, for example, as well as by some financial firms. But the move to cloud computing such as hosted apps by enterprises and consumers alike could signal a change in the way people expect to access their information.

And that's how grid computing comes in. The grid is middleware which sits on a network such as the internet and automates machine-to-machine interactions, thereby allowing computers and systems to connect and collaborate when it comes to managing data and running applications. The middleware used by the LHC is an unsurprisingly open source piece called Globus.

And if that can be scaled up to manage the biggest experiment across the most complicated computer link-up the world has yet to see, surely it can handle a few piddly online apps, too assuming we live past this week to try it out.