WinHEC 2008: Offload media for fun and profit.
By Simon Bisson & Mary Branscombe in Editorial
Posted in operating systems, Processors, Windows, Microsoft on
Windows 7’s library-aware Media Player is only one small part of a big change in the way Windows handles media. Elements Microsoft hinted at last week in PDC sessions and at the Windows 7 reviewers workshop are coming into focus here at WinHEC.
One interesting snippet from this morning’s keynote was the fact that Windows 7 would be able to offload media codecs to hardware. While the keynote referred to it as a way of transcoding media streams for delivery to network media players and other devices, it turns out to be part of a whole new way of handling media in Windows – one that more than just Media Player will be able to use.
The key is what Microsoft is calling Windows Media Foundation, a low level layer that links to device drivers and hardware. It’s this new layer that handles dynamically switching media streams from device to device when you plug in new hardware (and when you unplug it again – great for using your Bluetooth stereo headset when you want a little privacy in the office), and it’s also the layer that makes sure Windows sound schemes aren’t routed to communications devices and applications – so no more IM bings and bongs when you’re talking to a colleague on a Bluetooth headset using Skype.
One important function for the Windows Media Foundation is handling hardware codecs. The latest generation of graphics hardware contains support for H.264, along with AAC and other sound schemes. In Windows 7 hardware will have priority over software – so if your graphics card or motherboard will do the work for you, your CPU won’t need to take the strain. It’ll even work with USB offload processors. The real trick comes in when you’re transcoding existing media for streaming to a remote player. If it supports DLNA 1.5 profiles and reports the media formats it supports, Windows 7 will use the Windows Media Foundation to handle converting your media to the appropriate format while it streams.
You can transcode in software, but it adds latency – so if you’re hardware supports it, Windows will divert your streams to the hardware, and just deliver the result to the client device. It’s a sensible response to a tricky problem, and one that also means you can handle all aspects of a conversion in the hardware, without needing a CPU at all…
Specialised hardware will always have an edge over the general purpose CPU, so it’s important for operating systems to take advantage of them. Microsoft isn’t alone in doing this – Apple will be doing much the same with Snow Leopard (and is using a Quicktime plug-in to take advantage of NVIDIA’s hardware H.264 support on the latest MacBooks). This is a trend that’s going to end up everywhere, from desktop PCs to servers, to phones – and it’s one that’s going to save you time, power and embarrassing pauses. What’s not to like?
–Simon
Under the MacBook hood with NVIDIA
By Simon Bisson & Mary Branscombe in Editorial
Posted in Processors, Silicon, Hardware, Laptop, Apple on
Apple’s switch from basing its laptops on Intel chipsets to NVIDIA’s new 9400M series has raised more than a few eyebrows. There’s a good reason for that switch, as I discovered when I had a conversation with NVIDIA’s Rene Haas last week.
In the past mobile graphics chips have been a poor cousin to their desktop relations. Some may have the same product numbers, but a fraction of the power. With the advent of technologies like OpenGL and the rise of General Purpose GPU computing (GPGPU), laptop GPUs looked like they were being left far behind. Popular software is starting to take advantage of GPU computing, with companies like Adobe taking advantage of GPU programming to accelerate and smooth operations inits latst version of the CS imaging and design suite. You couldn’t get the smooth rotations and zooms in Photoshop CS4 without OpenGL - and if your chipset doesn’t support it, you’ll just get an error message.
Apple’s new machines aren’t just using the 9400M for OpenGL. There’s a lot more to the chips than GPUs (though the 16 GPU cores take up most of the silicon). The chips also include much of the core system hardware you usually find as seperate chips. The result brings the Northbridge and Southbridge into the same package, using much less real estate and allowing motherboards to be less than 1/2 the size, and at the same time giving increased graphics performance for the same power footprint. Laptops get better gaming performance, and applicaitons get better user interface effects.
The MacBook’s improved video performance has been noticed, and it’s down to the 9400M’s built-in HD video support. There’s hardware support for the H.264 HD video codec Apple uses for its iTunes movies, as well as support for many of the decryption techniques needed to work with DVDs and BluRay. While Apple may not support BluRay yet, Windows will with Vista’s SP2 release, and NVIDIA’s chips handle the AES encryption used on BlyRay discs, as well as handling high-end features like BD-Live.
The MacBook Pro shows off another of NVIDIA’s features, Hybrid SLI, which lets hardware developers add a second GPU for more processing power when it’s needed - turning it off when it’s additional boost is uneccessary. The Pro has an additional 9600MGT which can be used for gaming or intensive image processing - using more power than when a single GPU is used for word processing or web browsing
So why is NVIDIA producing this new chip? The main reason is the size of the laptop market. New laptops will outsell desktops by a large margin by 2012, and users want the same performance in their bags as well as on their desks. Only a small proprtion of notebooks have discrete GPUs, with most using integrated graphics. GPUs need to compete with integrated chipsets on price, form factor and performance, so this is where a new single chip solution comes in to play.
Therre’s an interesting caveat to this story, too. NVIDIA’s CUDA GPGPU framework has become an interesting tool for developers who want to work with massively parallel application programming on GPUs. In the past it’s been resistant to talking about other GPGPU frameworks - but the Apple relationship is changing that. Apple has anniunced that it wil be supporting the OpenCL GPGPU APIs in the Snow Leopard release of OS X, and as a result, NVIDIA will be supporting OpenCL access to its CUDA frameworks. Supercomputer performance in a laptop will be a very interesting side effect of the 9400M chips.
This isn’t an exclusive deal with Apple, either. There will be more laptop manufacturers switching to this approach in future - so we can look forward to a much better laptop experience with Windows and Linux in the future.
–Simon
CPU vs GPU, mythbusted or mythdirected?
By Simon Bisson & Mary Branscombe in Editorial
Posted in visualisation, Processors, Silicon on
The folk from Mythbusters were on hand at NVision08 to show the audience the difference between CPU and GPU computing. In true Mythbusters fashion they did it with vast amounts of paint, and what must have been one of the world’s largest paintball guns.
First they began with a simple (for them) demonstration of serial operations - using a paintball gun wielding robot to draw a smiley face on a whiteboard. A hundred or so blue dots made the robot one of the slowest (and loudest) dot matrix printers we’ve seen.
Parallel operations would take something a little larger, and their 1100 paintball inkjet printer filled much of the stage. Powered up it would create a picture of the Mona Lisa in glorious 8-bit colour in a fraction of second. Huge air tanks held the compressed air the device needed to simultaneously launch all the paintballs in all the tubes.
The demonstration was certainly impressive, but it was more than a little misleading.
The type of data-centric work that CUDA GPUs handle is more about using parallel processes to handle lots of small pieces of data, not about building complex images from small pieces of data. With a parallel architecture like that you develop algorithms that break down big problems and big data sets into smaller, easier to work with, pieces. Farmed out across tens and hundreds of processors in a GPU, each data block can be processed, before being reassembled and the results delivered.
They’re not new techniques, either, for one thing the approach is at the heart of computational fluid dynamics and finite element analysis. The parallel techniques used in GPU computing are certainly impressive, and are already delivering supercomputing to the desktops of the scientists and engineers who need the power (an Nvision session on using GPU-based supercomputers to model the plasma dynamics around neutron stars and the black hole at the centre of the galaxy was particularly impressive). Low-cost high-performance computing is the GPU’s strength, especially when compared to the hefty power requirements of an equicalent array of traditional CPUs.
The Mythbusters’ demonstration was good (and an enjoyable piece of theatre), but it really told a different story. So how could the intrepid special effects team have told the real story of GPU computing?
How about one robot carrying a large, heavy cube across the stage? Suddenly it’s joined and over-taken by a swarm of smaller machines, all carrying smaller cubes - cubes that weigh as much as the single cube on the struggling robot. Or if paint is the preferred metaphor, a can of paint slowly emptying through a single pipe. Meanwhile another can empties through hundreds of holes in much less time.
So, how would you demonstrate it?
–S
Let’s get physical
By Simon Bisson & Mary Branscombe in Editorial
Posted in Processors, Silicon, Software on
Nvidia has decided that the visual computing world needs a conference, and has taken over San Jose to deliver just that. It’s an odd event, with a high-level academic parallel processing track running alongside highly analytical business sessions - and what’s billed as the world’s largest LAN party filling one of the conference halls.
Games may have made Nvidia, but it’s the rest of the graphics industry that keeps it going. Simulation and CAD drive much of today’s industrial design, while complex financial calculations can be run on GPU-powered parallel processors. It’s not just black hole plasma dynamics - it’s also the models that help calculate how a fusion reactor will operate. According to Nvidia GPU computing is bringing supercomputing to the desks of the people who need it the most - for just the cost of a video card.
One of the keynotes showcased a NASCAR simulator used by drivers to hone their skills. On stage we heard a populist story of what it was like to be a driver, and what it was like to use simulation tools. Off stage we heard a more interesting story about how the simulator developers were looking at using the latest generation of GPUs in their application. The ability to use a GPU for parallel processing - and the availability of powerful hardware physiscs engines - has made them completely rethink their next generation, as the new hardware features mean that they can now work on making the simulation more realistic.
That’s what the drivers want. Asked what he really wanted from a simulator, Kyle Busch didn’t talk about new high-resolution graphics or realtime ray tracing. What he wanted was more accurate physical behaviours. In the real world passing on the left is different from passing on the right, while slipstreaming another car can change the performance dramatically. A simulation may look real, but without the physics it’s not realistic at all.
One plan for the next generation is to move away from the current car model, with only 6-degrees of freedom. Instead, it really needs 72 degrees, for all the hinge and flex points - all of which are changing dynamically. That’s where parallel processing comes in, as it allows a car to be modelled in real time, taking advantage of physics engines to turn those model calculations into real world behaviours. Improving the simulation will mean more (and happier) customers - as well as a continually improving model that can be shared with vehicle manufacturers.
It’s an approach that requires specialist processing that goes beyond the traditional CPU. Don’t confuse it with the death of the CPU, though. There will always be a place for the traditional CPU - it’s just that silicon technology has become ubiquitous enough for specialist hardware to offload processor intensive functions.
Need to encrypt something? Just use the hardware cryptosystem built into a TPM. Need to do thread intensive Java? Hook up an Azul network processing appliance. Need to do complex vector calculations on large amounts of data? Use a GPU. Nvidia’s CEO Jen-Hsen Huang talks about it as heterogenous computing, where the CPU handles tasks, and more specialised hardware handle the complex tasks that tax general purpose silicon.
Intel and AMD may still say that general purpose processors are just what the world needs - but they’re still investing in HyperTransport and QuickPath, the fast buses that specialised silicon needs. I wonder why they’re doing that, if specialised silicon is the dead end they say it is. Is there something about Moore’s Law they’re not telling us?
3G laptops: cheaper, faster, longer-lasting?
By Simon Bisson & Mary Branscombe in Editorial
Posted in Laptop, Hardware, Processors, Intel, Networking, Internet, Wireless, Mobile on
I wouldn’t be surprised to open a packet of cornflakes and have a 3G USB dongle fall out, they’re getting so common. They may be convenient but they’re not the most efficient way to get a 3G connection on a laptop. A notebook with a built-in antenna gets 25% better bandwidth (because the better the signal, the more data throughput you get). And given that most 3G cells have only a 1Mbps pipe connecting them to the Internet , you need all the throughput you can get.
The rumblings about EU regulation of SMS and mobile data costs carry on in the background along with OFCOM’s proposals for a voluntary code of conduct for ISPs to make sure your DSL line gives you the speed you’ve paid for, and OFCOM has also been making noises about checking out what speeds mobile broadband really offers. It’s a nice idea and it might concentrate the attention of the operators on the issue, but the speed you get depends on a mix of your handset, the Internet backhaul of the base station, how many other people are using data on the same base station - and the weather, so it’s hard to be precise.
I was impressed by the independent tests that Vodafone was trumpeting last month claiming they have the fastest HSDPA network. They’re claiming up to ten seconds faster to download a 2MB MP3 file (13.54 seconds) and four times faster to open a Web page (6.7 seconds). Anecdotally, Vodafone does feel faster than T-Mobile and Orange in the areas of London we visit, on EDGE and on HSDPA. With BT’s announcement today that it’s dropping backhaul pricing, if the mobile operators put in connections from the base stations to the Internet that are as fast as your connection from your phone to the base station, we’ll start to see which side of the network really needs to speed up.
I expect better battery life is also going to be better when you’re using built-in 3G than when you’re going through a USB port. The voltage won’t be much different but you can have much more sophisticated power management - and of course if you have a better signal, you don’t have to keep turning the radio up to try and improve things.
So Lenovo’s Centrino 2 announcements caught my eye today. Either the growth in the dongle market means Ericsson has dropped the prices of its 3G modules (scale, competition or a mix of the two) or Lenovo has decided that 3G is the best way to fight off the buzz around ultra-cheap machines like the Eee PC and Aspire One that cut features along with the price. Whichever it is, Lenovo is dropping the price premium for built-in 3G from around £100 to around nothing: from August 4th notebooks with a mobile broadband module will cost, and I quote, ”approximately the same price as those without”.
Although BT is now referring to the still-in-draft 802.11n proposal as a standard and putting it in the shiny new BT Home Hub (the rotating ten foot model of it at the BT event last night was a little scary), the n debacle drags on. At this rate, we might have HSDPA built into more laptops than 802.11n…
-Mary
Beyond the valley of the CPU
By Simon Bisson & Mary Branscombe in Editorial
Posted in Processors, Software, Applications, Server, Mobile on
(or “The return of the co-processor”)
The white heat of technology in the 1980s was focussed on the BBC Micro. Not only was it the heftiest 8-bit machines around, its open bus made it possible to add more processing power. With everything from music machines to Z-80s running CP/M, the BBC Micro could share its keyboard with many different CPUs.
Those days are on their way back.
Last week Toshiba announced a new range of consumer notebook PCs. Like many of Toshiba’s systems they’re designed to be media players, and in a side swipe at BluRay, they now come with an upscaling DVD drive. That’s where the coprocessor magic comes in, as Toshiba is using a derivative of the same Cell processor in Sony’s PS3 to drive its imaging software. A quad core version of the Cell sits alongside a dual core intel processor, and it’s used to handle a range of processor intensive tasks - acting as a feed to the GPU that drives the screen. Not only does it upscale DVD streams (very impressively) it also can be used to handle file transcoding (so your movies end up on your iPhone that much quicker), and also works well as a way of quickly indexing images and video.
Focused on video, Toshiba’s co-processor is also taking advantage of bundled web cams for a limited form of gesture control. Stopping a film by holding up a hand is effective, as is using a clenched fist as an in air mouse. Bill Gates’ departure reaffirmed his belief in alternative user infterfaces, and this is one approach to delivering those new ways of working.
Co-processors aren’t just for flashy graphics. Back in the 1990s I was writing mathematical simulation software, and at one point I had some electro-thermal models running on one of the MOD’s Crays. It wasn’t just any old Cray - it also had a co-processor in the shape of an additional vector processing unit. That vector co-processor made short shrift of my arrays of partial differential equations. Its direct descendent is a lot closer than an MOD research facility.
In fact, if you’ve got an NVIDIA graphics card it’s right in your PC’s GPU.
Back in January we wrote about Tesla and CUDA, and NVIDIA updated us on the next generation of the Tesla hardware earlier this week. The new G10 Tesla systems are looking very impressive, and the CUDA parallel programming language extensions are now able to work with standard multicore PCs as well as NVIDIA’s GPUs.
Memory is important when you’re using co-processors, and you need a lot if you’re signal-processing seismic data. Tesla will now support 4GB of directly attached memory per GPU, so a quad-GPU system can work with 16GB of data at a time. The numbers look good - and using Folding at Home a single Tesla 10 comes in at more than 40 times faster than a standard CPU, and more than 6 times faster than a PS3. Other demonstrations showed significant savings in space and in cost - one finance house has reduced its annual costs 9 times, replacing a 600 CPU options valuation system with a handful of front-end CPUs and 12 Tesla GPUs.
Of course with Snow Leopard around the corner, one of the obvious questions was about Apple’s support for OpenCL. It turns out that CUDA is best thought of as a personality layer on top of NVIDIA’s parallel thread execution (PTX) hardware, and it produces device-specific assembly code. There’s no reason why other GPU programming environments can’t produce the same PTX code - but CUDA will remain NVIDIA’s own route to the GPU as a processing tool, and it will be adding support for additional languages beyond C and C++ with Fortran just around the corner.
The future of the co-processor seems assured, for now at least. It’s time for software companies to start taking notice and to deliver on the promise of additional power beyond the CPU.
–Simon
CUDA - let the GPU take the strain
By Simon Bisson & Mary Branscombe in Editorial
Posted in Processors, Silicon, Applications, Business, Server on
The barracuda is the wolf of the sea, a slim silver dart that hunts in deadly packs. It’s perhaps not surprising that NVIDIA has taken part of its name for its GPU-based supercomputing tools.
On a recent trip to the US, Mary and I met up with some of the folk behind CUDA at NVIDIA’s Sunnyvale headquarters. It was a fascinating conversation - if only because I used to write scientific computing software, and something like CUDA would have sped up my work massively. When a problem takes days to solve, something using something like CUDA to accelerate processing makes a lot of sense.
Prior to CUDA, NVIDIA had tried to use GPUs for compute, but had run into architectural problems. Things changed with their series 8 GPU, which was very different to anything they’d built before, being designed for compute as well as graphics. That’s lead to some tradeoffs - there’s silicon on the GPUs that’s unused when it’s used as an accelerator (and vice versa). However NVIDIA makes so many chips, there’s not really any financial issue, it all comes out of the economies of scale.
CUDA is more than just a set of chips - it’s a language framework for working with GPUs, that can andle both sequential and parallel code together. Developers don’t need to learn anything you, and the framework gives programmers explicit - and simple - interfaces for running parallel code on NVIDIAs GPUs. There is a long term goal of providing tools for automating parallelism, but at this point you still need to work out what code can be parallelised yourself. The result is code that’s very simple with much less code, as CUDA handles repetitive calculations for you.
Simplicity comes from the hardware as well, as it manages threads for you. All you need to do is define the tasks the GPU will handle, and manage their interactions. The GPU then runs the calculations over the data, with groups of processors on different functions at the same time. As RAM is directly attached to the GPU there’s no need to use the PC’s own memory for caching data.
The numbers coming out of CUDA are impressive. Working with the VMD/NAMD molecular dynamics tools researchers at the University of Illinois have seen a 240X speed-up in the VMD ion placement tool, and an 8 to 12X speed up in NAMD. With an eye on greener computing, they’re also finding that CUDA gives them 1W/Gflop!
If you want this sort of power for your applications (and it’s remarkably suitable for large financial applications) you can by NVIDIA’s Tesla systems. There are work station versions, along with deskside offload processors. However the version we were most impressed with comes as a 1U rack mount unit, containing 4 GPUs. Connected to a PC or a server via 5 Gbps PCI-Express connections this is the way to give your data centre applications a significant speed up, with significantly lower power requirements.
While Tesla may not yet meet NVIDIA’s aim of providing a Teraflop in a 1U unit, it certainly speeds things up. Oxford University researchers have used it to get a 149X speed up LIBOR risk analysis for an 89X improvement on performance/Watt. That’s a good deal in anyone’s book - especially if you’re working with today’s fractious financial markets.
Add one to my list for the IT Santa!
–Simon








