A Gem of a Pattern
Researchers use computer science and math to discover a cancer gene
Picture a string of 12,625 numbers, with each number representing a gene. Then imagine 257 of those long, numerical strings lined up, one beneath the other so that you have a massive mosaic of numbers. Now, take a close look at that mosaic and see if you can find any discernable patterns in the numbers that correlate with specific human traits. Impossible? Almost.
Essentially that was the challenge put forth to University of New Mexico Computer Science professors Paul Helman and Bob Veroff by Cheryl Willman, M.D., director of the UNM Cancer Research and Treatment Center and her team. Not only did Helman and Veroff meet the challenge, they made an important discovery that may help save the lives of children with leukemia.
In 2000, a team of medical researchers at the UNM School of Medicine collected microarrays - slides imprinted with DNA chips - from 257 children with an aggressive form of cancer called Acute Lymphoblastic Leukemia (ALL). Each microarray contained 12,625 "probe sets." Each of those probe sets essentially represented the strength of a single gene from the patient. However, the School of Medicine did not have an effective way to analyze the data from the microarrays because of the volume and complexity of the information.
A Perfect Match
"We immediately saw that this problem matched our research interest," says Veroff. He and Helman had worked together since the 1980s when, as new faculty members at UNM, they had written a textbook together. Helman's background is in data mining and machine learning, while Veroff's is in automated deduction. Together they had been researching statistical machine learning, a process that looks for patterns in data using a mathematical model called a Bayesian network, or "Bayesian net." The model is based on a probability theorem developed by Thomas Bayes, an 18th-century mathematician and theologian.
Veroff explains that Bayesian nets can be used to find meaningful patterns in data, and to disregard spurious ones, by using known information about the data to direct the next step of the analysis. "One of the things we do with Bayesian nets is to come up with ways to use as much of the information that is available to us about a problem, to help us make the decision as to what we should look at next," says Veroff.
In 2001, Helman and Veroff started customizing a Bayesian network to handle the huge volume of data from the medical school's microarrays. Helman helps frame the scope of the challenge by noting that there were more pattern possibilities than there are atoms in the universe. "The number of possibilities in the data from the medical school was immense. So they needed an efficient way to look at the data. We tailored a Bayesian net for the specific characteristics of this problem. We started with a very large data set from the patients. And we had information on whether or not each patient survived his or her leukemia. The way it works is that you try to find patterns in the genes that correlate with long-term survival," explains Helman.
After the team tested the network and standardized the data from the microarrays, Helman and Veroff ran the data through the Bayesian net on a number of computers in the Center for High Performance Computing (HPC).
Finding A Pattern
By the summer of 2002, the process found a pattern - and a medical discovery. "Amazingly, our Bayesian net revealed that there was one particular gene that was extremely predictive of whether or not someone would survive their leukemia. When that gene was 'expressed' - or turned on - the patient had an extremely high probability of surviving their leukemia. When it was low, the probability wasn't as good," explains Helman.
The gene is named Outcome Predictor for Acute Leukemia 1 or OPAL1 and its discovery is making waves in the cancer research field. The team first presented their findings at a meeting of the American Society of Hematology in late 2003. The group has been invited to submit their research for publication in the New England Journal of Medicine.
OPAL1 has also been submitted for patenting. But most importantly, the discovery has important implications for determining appropriate treatment for leukemia and may some day lead to new treatment protocols for patients with ALL based on their OPAL1 gene expression.
As with most important finds, the OPAL1 discovery depended on collaboration. "It took quite a lot of time and commitment trying to understand the things the medical school was doing, and for them to understand the kinds of things we were talking about," says Veroff.
Dr. Joseph L. Cecchi, dean of the School of Engineering, agrees. "This collaboration between the Computer Science Department in the School of Engineering and the Cancer Research and Treatment Center in the School of Medicine was so successful because the faculty from each school took the time to learn about each other’s 'culture,' to bridge the gaps and solve this important problem."
Unique Analysis Approach
Helman's and Veroff's blending of classical mathematical theories and computer science is unique. While many researchers are looking at microarrays, few are applying Bayesian nets to analyze data like the UNM team does. Helman says, "The new computer science aspect that Bob and I developed and applied for the gene work is: A) how do you know which of the patterns to pursue most rigorously when you can only pursue a small number of them because of the sheer enormity of the number? And, B) how do you allow for the fact that there are so many patterns and you don't want to be fooled by ones that are just random associations? And how confident can you be that the ones that look good are actually meaningful? Those are the kinds of new computer science techniques that we have to apply to these kinds of problems."
The potential for combining Bayesian nets and computer science is wide ranging. The team has already customized the networks for a variety of applications. They have looked for the usage patterns of people trying to hack computers and are attempting to analyze export patterns to detect proliferation activities. Currently, they are working on a defense-related application that evaluates biosignatures of toxins like anthrax. The hope is that the research could lead to rapid field assessment to determine if people have been infected with toxins, viruses or bacteria. They are also funded by the National Science Foundation to study DNA damage response mechanisms.
Classical mathematic theories, algorithms and computer science all seem removed from the human condition. They are not, says Helman. "What we learn here is all transferable to humans." OPAL1 is perfect proof.