Big Data in Cancer Research

In my first post, I mentioned that I am participating in the Australian Breakthrough Cancer (ABC) Study. The ABC Study started in 2014 and will be working with over 50,000 Australian’s to investigate the causes of cancer and other diseases. The study will look at the role that our genes, lifestyle and environment play in the development of cancer….and it is hoping for breakthroughs that will enable more individualized prevention and screening to occur, and better targeted public health messages.

I first heard about the ABC study on radio late 2016 and immediately decided to sign up to participate in the research as I have seen more and more family and friends affected by cancer over the years.

The ABC Study asks questions about lifestyle, habits and the environment we live in. The questionnaire does take a little bit of time to complete as it is quite extensive. It includes questions on the standard risk factors such as smoking, obesity, alcohol and diet, as well as less obvious factors such as your childhood environment, medications, and health conditions.

Answers to the questionnaires are pooled together, and the ABC study will look for trends and compare people who are later diagnosed with cancer (or other diseases), to those people unaffected. It also looks at DNA from saliva (and selected blood samples), and the frequency of cancers in families. Check out the “ABC Study” for more details.

Participating in this study started me thinking about how much data is collected for cancer research.

Big Data

The concept of Big Data has arisen by the rapid expansion of data in the world, and it can mean different things to different people. But basically, Big Data is data that is collected from the many sensors in the internet of things, posts on social media sites, digital pictures and videos, retail purchase transaction records, cell phone GPS signals, electricity meters, surveys in questionnaires ….just to name a few.

It’s not the amount of data that is important for Big Data, but it is what organizations do with the data that matters. Big Data is analyzed to reveal patterns, trends, and associations, to assist with decision making ….such as minimizing risk of fraud in Banking, dealing with traffic congestion for Governments, marketing to consumers in Retail, and improving patient care in Healthcare…..also just to name a few.

We all know that the data in the world is growing significantly, with most companies dealing with Terabytes or Petabytes of data within their organisations today. So how much data is actually being created in the world today, and can we really calculate it?

Whilst no one could accurately calculate the amount of data in the world, IBM estimates that every day, the world creates 2.5 quintillion bytes of data, or 2.5 Exabyte’s each day, ….and IDC estimates that in 2012 we created 2.8 zettabytes of data, and forecasts that we will generate more than 40 zettabytes (ZB) per year by 2020.

But what are Exabyte’s and Zettabyte’s?

So before I go further, if you think a Yottabyte is something Yoda would say in Star Wars, here are the data storage terms in decimal (humans tend to talk in decimal to make it simpler, although computers use binary which is 1024 instead of 1000….. which explains why if you purchase a 4 TB USB disk and plug it into your laptop your computer will only see 3.638 TB).

TB to EB

Cancer Research

Cancer affects millions of people around the world, and according to the World Health Organization, 8.8 million people worldwide died from cancer in 2015. That is nearly 1 in 6 of all global deaths.

In the search for a cure for cancer, there are many research studies worldwide that are looking at Big Data to assist, and it is now widely agreed that the way ahead in cancer research is with genomics research.

The human genome is made of approximately 3 billion base pairs of DNA, and if printed out (from the letters in your genome would:

  • Fill 200 500-page telephone directories
  • Take a century to recite, if we recited at one letter per second for 24 hours a day
  • Extend 3,000 km if each letter was 1 mm apart

According to Precision Medicine’s article  “How big is the human genome”  the estimated storage capacity required to store one person’s genome (in a perfect world) is 700 Megabytes (MB), but in the real world, right off the genome sequencer, it is estimated to be 200 Gigabytes. But as only about 0.1% of the genome is different among individuals (FYI, we are all 99.99% the same), this equates to about 3 million variants (aka mutations) in the average human genome, and if we can make a “diff file” with just the list of variations, the data capacity requirement is estimated at only 125 megabytes.  

And, an aside, if you want to understand the genome a bit better, check out this Ted Talk: “How to read the genome and build a human being” by Riccardo Sabatini. I found it very interesting.

So, from a data perspective, what’s this all mean. If storing one persons DNA off the genome sequencer is 200 Gigabytes. Storing 50,000 for a cancer study is 10 Petabytes. But if we are able to store just the diff file it’s only 6TB. The reality is, the capacity will be somewhere in between depending on what is saved.

So, what’s in the future?

It’s interesting to look at the data storage requirements for healthcare as we continue to invest in research to find a cure for cancer. The data requirements today will potentially be petabytes per study, but the research requirements will likely start extending to exabytes if not zetabytes of data in the future.

However, it’s not the capacity of the data that’s important, but rather what we can do with the data. As we progress with genome research, we will be able to use DNA to track the progression of cancers and their response to treatment in real time. Blood tests are also being developed that could detect cancer, and locate where in the body the tumor is growing, without having to do invasive surgical procedures like biopsies.

And, at the rate of technology development, imagine a day in the future when you (or your children) walk into the doctor not feeling well. A simple swab of DNA is analyzed and a few minutes later the illness is known. The analysis also identifies other health issues before there are any symptoms, and a treatment plan with a pharmacy script is electronically sent to your health app on your phone, and your national health record is automatically updated.

Whilst this may seem a bit far fetched, it’s not that long ago that we watched TV shows with Maxwell Smart talking on a shoe phone, Dick Tracey making calls on his watch, and the Jetsons had 3D TVs, video phone calls, and instant food (a.k.a. 3D food printing)…..all of those things seemed quite bizarre at the time!

2 thoughts on “Big Data in Cancer Research

Add yours

  1. Great article Chris that puts some perspective on Big Data. At the moment it really comes down to volume and time: how much information can you store and process ? At the scale of which this is going, we will soon a dramatic disruption in the technology – maybe Quantum Computing ?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a website or blog at

Up ↑

%d bloggers like this: