Analytics get distributed, parallel and mathematical
By Simon Bisson & Mary Branscombe in Editorial
Posted in data warehouses, analytics, Applications, Storage on
We had a very interesting conversation today, talking about the next generation of business analytics with folk from Greenplum. The most interesting piece of their story was just how their application works with data.
With no legacy to build on, the Greenplum engineers could take a very different architectural approach. Traditional databases use a single store, and a single query engine. Greenplum’s tools break data up into parcels, sharing it across every machine in their data processing network. A central server keeps track of where the data is held, and manages queries - which can be broken up and delivered to the appropriate servers, the results being assembled by the controller. Supercomputer aficionados will immediately spot that Greenplum are using a shared-nothing approach, where queries can run in parallel on sections of the data - speeding things up considerably. Having a master controller handling scheduling means you can even use unmatched hardware for your data servers.
Complex joins can be handled in a similar manner, with queries moving data between servers and assembling results on many different processors. With quad core a commodity, and six and eight following close behind, it’s not going to be difficult to build a powerful data processing farm (and use the same hardware for other tasks when you don’t need high level analytics).
There’s another spin out of the architecture - you can mix different query types in one analytic operation. With Greenplum’s tools you can mix SQL with Google’s MapReduce, and even throw in the R statistical language for complex mathematical operations. Modelling is an important piece of business analytics, and means that Greenplum’s tools are able to compete with high-end analytical tools like SAS. There are plenty of interesting use cases here - perhpas you’re currently working with massive data sets that take a week to process and a day or so to feed into predictive models. With Greenplum you’ll be able to load the data in parallel and run your statistical models on the data - giving you a considerable speed advantage.
Moore’s Law has hit the wall. Intel’s spectacular U-turn showed as much, as clock speeds dropped and the number of cores went up. That’s left software developers with something of a challenge -
Tag cloud
Archives
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
Most commented posts
- Java's SSVAGENT.EXE: training the monkey
128 comments
- When Windows 7 upgrades won’t hibernate (the solution)
- Do you need IPv6 for DirectAccess? Yes and No
- Chrome OS: what happens when "always connected", isn't?
- The ColdFusion Renaissance
- Make Adobe Acrobat Pro deactivate
- Is there a showstopper bug in Windows 7 CHKDSK?
- There’s a reason smartphones are locked down
- At sixes and Windows 7s
- The LHC isn
Highest Rated Blog Posts
- Songs of distant satellites (100%)
- Nobody knows what Web 2.0 really is (100%)
- Log in and lock in (100%)
- Top tips for speeding up Vista (100%)
- Mommy, why is there a home server in the office? (100%)
- Employees are our most valuable asset (snigger) (100%)
- Locking down IT or blocking creativity (100%)
- Consumer BlackBerrys are good for business (100%)
- HD Trek (100%)
- Join the (beta) community (100%)

