HP Labs takes on “big data” with innovative archiving system




Senior researcher Craig Soules
Senior researcher
Craig Soules

Businesses are creating and storing unstructured digital files by the billions – from x-rays to PowerPoint presentations to scanned-in photographs. But for all these files to be useful, people need the ability to find them quickly and easily.

Researchers from HP Labs, in collaboration with technologists from HP Storage, have developed an innovative data archive system to solve this mounting challenge. They combined HP Storage’s highly-scalable IBRIX file system with a scalable database developed in HP Labs. With enhanced auditing capabilities and indexed searchable metadata, the IBRIX Archive provides a large-scale content archiving system that allows organizations to quickly find and analyze files from their data archives.

THE BASICS

The IBRIX Archive combines the IBRIX file system with HP Labs’ scalable database by synchronizing the data stored in the file system with the metadata describing that data stored in the database. To understand how this works, let’s look at how data lives within these two entities.

A file system stores data, also referred to as data objects. In the case of a hospital, the data object might be an x-ray. A database stores the metadata associated with that x-ray, such as the patient’s name or the doctor that created it. When a new data object is entered into a file system, corresponding metadata must be created and entered in the database too. The IBRIX Archive manages this synchronization of data objects and metadata.

To understand this further, let’s expand on the example of an x-ray. A hospital might store x-ray images for tens of thousands of patients in its data archive. Each x-ray could be tagged with multiple metadata such as the date the x-ray was taken, the patient’s name, condition, and health insurance provider, and the name of his or her doctor.

If a hospital administrator wanted to see all the x-rays done by a certain doctor on patients with a certain condition, for example, the IBRIX Archive system could quickly search the metadata and pull up the correct files.

INNOVATION

The innovation in IBRIX Archive is its ability to improve the speed of metadata searches by utilizing the high-performance updates available in HP Labs’ database technology. To appreciate this innovation, some background on databases is helpful.

A database using online transacting processing (OLTP), a common type of database that facilitates transaction-oriented applications, doesn’t allow queries to its metadata if the metadata is being updated with new information. This often results in slower database searches.

Data warehouses have solved this problem by using two separate database technologies. This involves the use of one OLTP database to handle updates to metadata, and a second, read-only database that loads the data from the first database on a periodic basis so it can be searched. Unfortunately, this approach provides only “snapshots” of data, often resulting in data that is hours old.

More recently, we’ve seen the emergence of “eventually consistent” databases, which improve the speed of searches on changing data by sacrificing the freshness of data provided by an OLTP database.

The database system developed by HP Labs researchers solves this problem by using a hybrid approach in which transactions are delayed and then processed in batches. This improves the performance of database updates, while also enhancing the freshness of the data.

The database system also allows users to set thresholds for data staleness, but this requires a sacrifice in query speed. The system lets you do a quick query that excludes the most recently updated metadata, or a slower query that searches everything.

The reason it’s faster to not search a database’s most recent entries is because of its pipelined design. When a database receives updates, it must first sort them in order to improve query times. Therefore, queries against recent updates must scan all the unordered metadata in order to return a result. Queries against stale data, on the other hand, only need to locate the sorted and indexed metadata.

So, referring to the x-ray example again, if a hospital system administrator wants to find out if MRIs are outpacing x-rays and how that will affect storage needs, a quick database search of the metadata can reveal the answer. If you’re analyzing a five-year trend, you don’t need to include the last 10 minutes’ worth of data. “You can make this tradeoff between freshness and performance,” said Craig Soules, a senior researcher at HP Labs.

The IBRIX Archive system works especially well for an archive, since organizations use archives to store data they won’t need right away.

BENEFITS

HP’s IBRIX Archive allows users to have the best of both worlds when managing vast quantities of archived data: fast queries most of the time, and slower, more up-to-date ones when they need them.

“There are times when you need to conduct up-to-the-second queries,” Soules said. “Our archiving system allows that.”

By combining HP Storage’s highly-scalable IBRIX file system with HP Lab’s scalable database, the IBRIX Archive can scale up to hundreds of machines, which is important because you can’t put 100 million files on a single machine, Soules said.

FUTURE

Businesses in many industries are building archives of unstructured data that they’ll need to both search and analyze. Whether it’s a life sciences company archiving data in a genomic database or an entertainment company storing a film archive, HP’s IBRIX Archive system offers a powerful solution.


* For more on this, see the report the team wrote for Lloyds of London on digital risk.