Information archaeology
By Mary Branscombe,
Mimecast’s intermediate format doesn't help with attachment formats but the service does offer what Blake calls ‘minimal amounts of data scraping’ to extract text from documents in 11,000 formats. He believes this is going to be a major problem for many companies – and a costly one.
“We pulled in a load of data that had been stored on tape onto online storage that the customer wanted to mine for BI purposes, and we found out that the biggest part of the whole project was not the £500 million storage array but having to build the software to interpret the data from the mainframe system.”
Brunel University is taking its email archiving in house, using HP’s Integrated Archive Platform for archiving documents and email, including PST files from user machines. According to Iain Liddell, the policy development manager at Brunel, retrieving documents for a recent police investigation took two weeks and resulted in 800 pages of evidence. “If we’d had the system we are building now, it would be a morning and eight pages of evidence.”
The system works with PDF files and retrieves attachments, but he’s still looking for a solution for keeping CAD and graphics files accessible and he’s following developments at the National Archive. “Building regulation documents may need to be kept for a hundred years - there's an interesting problem.
At the moment, we do have some physical virtual machines in the estate office which are still running Windows 98. We look forward to a finding a solution that will take these out of service and retain access to the CAD drawings.”
The university plans to keep the archive manageable using role-based email. “There is a world of a difference between the finance director emailing me about computing policy and the finance director emailing me about my pension. We're starting to build aliases for key staff so if he emails me about computing policy he will email to my alias which is my job title. If he is emailing me about pensions he will be mailing to me by name and our archive is being built to distinguish between the two. As we roll role-based email out to people, it will become more and more normal for me to think ‘is this to the role or the person?’ Are you saluting the uniform or the person?”
To make that work, he doesn't plan to rely on people remembering what address to use. “We're looking to develop a probabilistic system that will look at an email or a document the way that the anti-spam software looks at messages and say ‘this looks like a building regulation, we've got to give this 100 years; this is about a student so we keep it for six years after graduation’.”
Although full natural language parsing is still a research problem, he predicts it will be possible to classify using technical terminology. “We've hope to be using simple dictionary-based systems in the coming months. We've already done it in terms of incoming mail, so we could quarantine something that wasn't spam but was not to be distributed.” The university will still need the new data centre it’s built. “This won’t reduce the amount of storage we need, it will just take us slightly longer to fill it up.”
You may also like...
Sponsored Links
advertisement
You may also like...
Latest Storage News
HP plans massive job cuts
HP profits fall 31 per cent, as company announces further restructuring.
Latest Storage Tutorials
How to recover and restore deleted or damaged files
If you've accidentally deleted a file or you have an office full of corrupted hard disks, there's no need to panic. It may still be possible to recover your lost data using a freeware utility as Jim Martin finds out.
advertisement
Most popular
- Apple iPad 3 vs iPad 2 head-to-head review
- Dell EqualLogic PS6100XS review
- Chromebooks: What's gone wrong?
- ICO: Fines for cookie law breakers
- UK regulator shuts down Angry Birds scam
- Open source software driving cloud-based innovation
- Fujitsu targets enterprises with Android ICS tablet
- IBM bans use of Siri on iPhones
- Dell PowerEdge R820 review
- BlackBerry 7 OS certified to carry 'Restricted' UK government information






Information Software
Information archaeology & Software 12 Jan 2009
I found your article on the problems of maintaining documents very relevant. I recently had a computer failure. Despite maintaining many back-up copies of my many documents, it has taken me 5 months to get back to some form of working. Some of my records are now in serious trouble. Although I am into designing and manufacturing equipment, I don’t rely on that work for a living, if I did, I would now be out of business.
The problem I have had is as you describe, in your article, the non compatibility of old software. I have been able, with the help of others and forums, to recover 10 years of e-mails. But my biggest remaining problem is with the hundreds of Drawings I have done over the years. These drawings are part of many manufacturing documents and other ongoing records. Having had to upgrade from Microsoft Windows 98 to Vista, I now find that I can no longer work with all those hundreds of important drawings.
For long term computability, I thought, I had been using a very good, simple and quick drawing available within most Microsoft operating systems, namely “Microsoft Drawing 1.01”. I now find that Microsoft no longer support Microsoft and that all these old drawings are now in serious trouble, I need them to continue for 20 to 30 years. I have eventually been able to get various software working so that I can see and print these documents with the drawings in the MS Word and Excel documents, but I can’t edit or update them. I am going to have to redraw many of them, if I can find suitable, simple drawing software that can do that work and remain readable and editable for many years.
The software industry needs to realise that we are on this planet and hopefully, will be around for many years to come. Our information and records Must be able to stay with us for hundreds or thousands of years, not just 5 or 6 years. What is going to happen to all our human knowledge when that asteroid does hit Earth ? We can still find and read old Egyptian and Mayan records, and many records in much of the ancient world, but what about all our records since computers came into being ? What if it takes 10 years to get electricity working again ?
But the main point is that present day records should continue to be workable for 50 – 100 years, not dead within 5 years as is presently happening.
By Ip_HughLeytondcd on Tuesday Jan 13