by Simon Firth
Early last year, HP set out to digitize the entire
80-year print run of TIME magazine, one of the world's
most widely read publications.
The idea was to automate the process as much as
possible, going far beyond what had previously been
achieved in the field.
In both technical and organizational terms, the
job presented an enormously complex challenge. The
archive was not only physically huge -- running to
more than half a million pages -- it also changed
in character over time, making the material to be
digitized something of a moving target.
To accomplish the task, HP forged an unusual collaboration
between its Consulting and Integration (C&I)
group and HP Labs in an undertaking that required
both groups to go beyond their usual areas of strength
and put everyone under intense pressure to deliver.
At times, recalls HP Labs’ Program Manager
Giuliano Di Vitantonio, “it was absolutely
crazy. There were weeks when we just didn’t
sleep.”
By the end of the year, however, through a mixture
of technological innovation and disciplined process
management, the TIME team had processed the entire
archive with almost 100 percent accuracy.
The resulting collection, housed at TIME.com, provides
a vast online record of history in the making from
1923 to today.
The archive contains each issue's cover, photos
and more than 266,000 original articles, covering
everything from the Great Depression to the Beatles'
U.S. debut (headlined "The Unbarbershopped Quartet")
to the formal birth of the United Nations in 1945
to the race to map the human genome.
Typically, the only way to digitize a magazine archive
has been to retype it by hand.
Applying Optical Character Recognition (OCR) software
might seem a good technological alternative. But
state-of-the-art OCR software is only 99.5 percent
accurate, and it lacks the intelligence to reassemble
the blocks of text it recognizes into complete articles.
Was there a way to apply new technology to the problem?
Building on previous work they’d done for
the MIT Press, researchers at HP Labs came up with
a three-stage solution.
First, they passed each digitally scanned magazine
page through multiple OCR engines and selected the
best output from each using a series of algorithms
they developed.
That increased the OCR accuracy rate beyond what
is typically expected. Yet many errors still had
to be addressed, particularly the connection of articles
across page boundaries.
In a second stage, the Labs team reconstructed the
articles from their constituent parts. They did this
by creating a software engine that could recognize
and exclude sections of each page -- such as advertisements
and photographs -- that were not article text. The
software then made intelligent guesses to determine
the correct sequencing for the text blocks.
In this, says Labs' Di Vitantonio, they managed
to reach 80 percent accuracy. Links between text
sequences were identified with moveable arrows on
a graphic reproduction of the page.
Full accuracy came with the third stage that employed
a tool designed by HP researchers and developed by
C&I consultants. This enabled C&I staff to
manually link zones together to recreate the reading
flow of the articles where the software had guessed
wrong.
The challenge of digitizing the TIME archive was
made all the harder by changes in the magazine over
the last eight decades.
Early issues used fonts that no longer existed,
for example, and had pages that were often damaged
and thus hard to read. The newer issues were clearly
printed but often employed dynamic graphic layouts
that made it extremely difficult to distinguish text
elements from photographs or figures.
“Essentially, we were dealing with the history
of modern printing of the last 80 years,” says
Jeff Hager, solution delivery manager for Rich Media
in HP C&I who oversaw the TIME project.
The work was carried out in six phases, each of
which had checkpoints where, if content didn't meet
quality standards, it went back a stage.
Managing that process was a huge logistical challenge.
The original magazines were scanned by C&I in
Bridgewater, New Jersey, and then the resulting TIF
data files were shipped to the HP Labs’ Barcelona
Research Office in Spain, home to the HP Labs Digital
Content Remastering Program. Once processed, the
data were shipped back to Bridgewater for manual
correction.
“When you’re dealing with content,” notes
HP Labs researcher John Burns, “it’s
old, it’s messy, it’s dirty, it’s
incomplete. And so you end up running an industrial
operation.”
That’s not the kind of thing HP usually does,
says Burns. HP Labs also took on the unusual role
of running data processing for the TIME job -- a
task that, in this case, took 44 days of uninterrupted
server operation.
“This wasn’t something we transferred
to a division or a business unit and they did it
with our technology," says Di Vitantonio with
some pride. “We really ran the operations for
the entire volume.”
While Labs was running some of the project’s
operations, C&I was required to do far more software
development than usual, notes C&I’s Jeff
Hager.
“It was a really good collaborative effort,” Hager
says. The project, he says, required everyone on
the team to rapidly customize their solution to meet
demanding customer needs on very short notice. That
wasn’t easy.
“Often, we were trying to figure out what
to do, how to do it, how to build the tools to do
it -- and actually do it all at one time," he
adds.
“The magic of all this is how all the different
parts of HP worked together to make this happen,” says
Hager’s boss, Douglas McMahon, VP, HP Systems
Solutions and HP’s principal liaison with TIME.
"The relationship between TIME and HP,” McMahon
adds, “led to a solution based on people, process
and technology that could get this job done.”
With the newly digitized content as a prime feature,
TIME.com’s archive went live late last December.
Visitors to TIME.com can now search the entire output
of the magazine from 1923 to the present day, as
well as search for covers and browse stories grouped
by theme.
Access to the archive is free to subscribers; otherwise, users must pay
a small fee or be limited to article summaries.
HP is now looking into whether its digital content
remastering solution might be offered commercially
to other publishers.
Meanwhile, the HP Labs team is looking for its next
challenge in what researchers call content-driven
computing, exploring opportunities in creating IT-based
solutions for the manipulation of digital content.
“Seeing our research get into a final product
or service consumed by real people is always a great
achievement,” says Di Vitantonio. “That’s
the really exciting aspect of this.”
|