Research Guides/April 27, 2026/6 min read

Wait, Genes Only Make up 2% of Total DNA?

A review on protein-coding vs non-coding DNA

Xander Turco

Protein Coding DNARepetitive DNAHistory

Typically, when we think about biology and genetics, we go straight to genes. This is what every elementary and high school biology course in North America teaches us. As humans, we have around 20,000 well-established genes. However, humans are much more complex than just their genes, and I am here to tell you how in the world that is possible.

Let’s First Define a Gene

Biologists typically refer to a gene as a piece of DNA that contains the information required to produce a specific protein that plays a biological role in the cell. We refer to these as protein-coding genes, and a central theory in biology is called the Central Dogma. The Central Dogma states that genes are transcribed into mRNA and subsequently translated into proteins, which fold into stable structures to perform specific tasks.

Mechanistically, this is a nice and tidy explanation for many relevant biological processes, which is one of the reasons scientists have been so drawn to these genes. Another reason protein-coding genes have received so much attention is that they are easier and cheaper to sequence. After all, it is a lot easier to sequence 20,000 genes than it is to sequence an entire genome.

Now Let’s Define a Genome

As I mentioned, the human genome contains around 20,000 protein-coding genes, which account for a mere 1–2% of our entire genetic makeup. Our genome consists of about 3.3 billion nucleotides. Yes, that is essentially a book with 3.3 billion letters, written using only four characters: A, T, C, and G.

So, what is the other 98% of human DNA?

Historically, scientists referred to this 98% as junk DNA. They thought it was impossible that this portion of the genome could be serving a functional role. These regions were considered redundant and free from selective pressures, allowing mutations to accumulate without damaging the organism. However, these regions are now recognized as critical regulators of gene expression and genome organization [1].

If This Region Is Not Protein-Coding, How Is It Relevant?

We now know that the genetic code alone cannot fully explain most genetic diseases [1]. This is why the field of genetics has shifted from studying the role of single genes to studying entire systems and how different regions of the genome interact with one another.

Non-coding elements can have diverse roles in the regulation of protein-coding genes. Broadly speaking, they include cis-regulatory regions and non-coding RNAs [2].

Cis-regulatory regions include promoters and distal elements, such as enhancers, silencers, and insulators. These regions regulate gene expression through the binding of transcription factors.
Non-coding RNAs can be divided into several categories, including tRNAs, rRNAs, small nucleolar RNAs (snoRNAs), small nuclear RNAs (snRNAs), miRNAs, and long non-coding RNAs (lncRNAs), which are greater than 200 nucleotides in length [2]. These RNAs act through different mechanisms to modulate gene expression, and many are known to play important roles in cancer biology.

Additionally, many non-coding sequences are repetitive or transposable, meaning they can move around the genome. These elements can facilitate genomic rearrangements and have played important roles in evolution [1].

If They Are So Important, Why Are They So Understudied?

I do not mean to take away from any of the incredible advancements that have come from studying protein-coding genes. In fact, I should lay out some of the major advances that have come from studying these regions [3]:

CFTR modulators, such as Trikafta and Kalydeco, act as personalized medicines by helping correct the defective protein produced by specific mutations.
Targeted therapies and immunotherapies in oncology, such as monoclonal antibodies against HER2, which is overexpressed in HER2-positive breast cancer, have revolutionized treatment.
Osimertinib, which targets EGFR-mutated non-small cell lung cancer, has drastically improved disease-free survival.
In advanced melanoma, ipilimumab, a monoclonal antibody directed against cytotoxic T-lymphocyte antigen 4 (CTLA-4), has improved survival.

These advancements have revolutionized how we treat cancer and rare diseases. But I cannot help but feel like we are still missing something: the other 98%. If we can gain this much insight from only 2% of the genome, imagine the possibilities as we begin to decode the remaining 98%.

So far, I may have made it sound like researchers simply dismissed these regions, but that is not entirely true. The reasons for not studying this 98% of the genome stem from deep-rooted issues related to the complexity of these regions. Genomic technologies still struggle with repetitive regions of non-coding DNA, and newer approaches, such as long-read sequencing, are still relatively recent compared with traditional sequencing technologies. As a result, many researchers have historically been cautious about trusting data generated from these parts of the genome.

Scientists in the repetitive DNA community have made a strong case for not masking away repeats in genomics studies [4], and I am on their side. As sequencing technologies and computational tools continue to improve, I believe these regions will become increasingly important for understanding human biology and disease.

Final Thoughts

As the cost of genetic sequencing continues to decrease, healthcare is moving toward not just treating symptoms, but understanding the underlying mechanisms associated with disease. Genome sequencing is pushing healthcare toward a more personalized approach. It is already being used to diagnose cancer and rare diseases, assess how certain treatments may affect individual patients, and screen for abnormalities in newborn children.

In my opinion, there may come a point when you go for your next physical with your doctor and, instead of pulling up scattered notes, they pull up your genetic sequence. From there, they may use it to assess your overall health, identify potential risks, and determine how to prescribe medications in a way that is best suited to you.

References

[1]
Ruffo, P., Traynor, B. J. & Conforti, F. L. Unveiling the regulatory potential of the non-coding genome: Insights from the human genome project to precision medicine. Genes Dis. 12, 101652 (2025). [https://pmc.ncbi.nlm.nih.gov/articles/PMC12355918/)

[2]
Khurana, E. et al. Role of non-coding sequence variants in cancer. Nat. Rev. Genet. 17, 93–108 (2016). [https://www.nature.com/articles/nrg.2015.17#Sec3)

[3]
Khan, A. et al. Genomic medicine and personalized treatment: a narrative review. Ann. Med. Surg. (Lond.) 87, 1406–1414 (2025). [https://pmc.ncbi.nlm.nih.gov/articles/PMC11981433/)

[4]
Slotkin, R. K. The case for not masking away repetitive DNA. Mob. DNA 9, 15 (2018). [https://pmc.ncbi.nlm.nih.gov/articles/PMC5930866/)

[5] Images generated using ChatGPT, Apr 26 2026.