Announcement

Collapse

Natural Science 301 Guidelines

This is an open forum area for all members for discussions on all issues of science and origins. This area will and does get volatile at times, but we ask that it be kept to a dull roar, and moderators will intervene to keep the peace if necessary. This means obvious trolling and flaming that becomes a problem will be dealt with, and you might find yourself in the doghouse.

As usual, Tweb rules apply. If you haven't read them now would be a good time.

Forum Rules: Here
See more
See less

“DeNovo Origin of Human Protein-Coding Genes” or How Some New Genes Come About.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • “DeNovo Origin of Human Protein-Coding Genes” or How Some New Genes Come About.

    Hi All,


    

I read another paper and think I understand it reasonably well, which made it an interesting paper.

    And so I have the urge to share.


    The origin of new genes is an important topic in evolutionary biology. While there are several different mechanisms behind the formation of new genes, which will be described soon, in the following set of posts, I will be discussing this:-

    De Novo Origin of Human Protein Coding Genes

    - paper, which deals with a mechanism of new gene formation which is only just being appreciated.



    In essence, this process causes new genes to form from the vast stretches of DNA lying between the protein coding genes. Naturally, protein coding genes are those which do code for functional DNA, and the stretches of DNA between are generally named “junk” because it seems that these regions have little to no function. However, scientists are beginning to find evidence that they may be regions from which new genes can arise. 



    To find examples of these denovo genes, researchers have to trawl massive databases looking for strings of DNA in close relatives of humans (chimpanzees and orangutans) that are similar to strips of DNA in humans but which do not code for identifiable proteins in any of these organisms and were non translatable in the chimps and orangutans, but were translatable in humans. This latter point means that in humans, mutations had occurred in the DNA to cause translation ‘start here’ codons to evolve. Once DNA is translated into polypeptide then at least it is open to selection and is potentially functional, and at best that polypeptide is a fully functional protein, and the underlying DNA is an actual gene.



    (The reader needs to remember that the genome of an organism consists of massively long strings of DNA comprised of a sugar backbone from which bases, labelled A, C, T or G are fixed. Sets of three bases define an amino acid, strings of which go to make up polypeptides or proteins. To get from DNA to protein, the cellular machinery first transcribes DNA into messenger RNA, which is very much like DNA, which is then translated into strings of amino acids which are called polypeptides. Polypeptides which are functional are what we call proteins. To refresh your memories, and possibly confuse you even more, there is this which adds a bit of technical detail to my very sloppy description.)

    Because humans, chimps and orangutans share common ancestry, these similar strings of DNA were passed on via the process of common descent. The common ancestor had the string and it was passed on to the lineages that became chimp, orangutan and humans, subsequently diverging in each lineage thanks to the accumulation of point mutations. But they did not diverge enough such that all traces of common ancestry were gone. That they were not translated in chimps and orangutans, but were translated in humans meant that in humans these strips had mutated to the point of potentially being functional in humans.


    Finding these comparable strips of DNA proved to be a challenge and the number of them suggested to the researchers that denovo origination of new genes from the non coding regions of a genome may be more frequent that had previously been thought.



    Because I think I understand the above paper reasonably well, in the next set of posts I will describe what the paper reports, firming up my own understanding and passing on to the reader something of the idea regarding a means by which new genes are thought to arise in organisms.



    This has been my introduction. In the next post I discuss the paper’s introduction.






    To be continued ....

    Last edited by rwatts; 09-11-2014, 11:41 PM.

  • #2
    Great information!

    De novo genes are going to be difficult for the anti-evolutionists to explain.

    I predict that some baraminologist will simply expand the notion of baramin to include "de novo within kind."

    Of course then the Gitt-Fernandez treatment of "information" will need to be modified. Fuel for another book I suppose.

    K54

    Comment


    • #3
      Originally posted by klaus54 View Post
      Great information!

      De novo genes are going to be difficult for the anti-evolutionists to explain.

      I predict that some baraminologist will simply expand the notion of baramin to include "de novo within kind."

      Of course then the Gitt-Fernandez treatment of "information" will need to be modified. Fuel for another book I suppose.

      K54
      I do think Jorge should pop in for a discussion and tell us about "evolution" and "Evolution". Then I can tell him about "meteorology" and "Meteorology" where the latter shows the clash between meteorological science and the clear word of the Bible. Ironically, then it comes to "Meteorology", creationists, as far as I can tell, are compromisers and accomodationalists, things they do not like Theistic Evolutionists for.

      Comment


      • #4
        The Gitt-Fernandez notion of "information" will definitely have to be expanded so that de novo genes cannot (by definition!) increase it.

        Perhaps they will argue that the de novo genes are really not de novo, just like "junk" DNA really isn't "junk" (you stupid Darwinists!!!).

        There's one thing certain. ID/anti-evolutionism delta(entropy)>=0 and is = zero if and only if one posits that the scientific clock has stopped, say around 1960.

        Why? Because increasing knowledge of genetics results in more microstates that need explained. ergo the ID Entropy (G-F-ID) necessarily increases.

        Maybe I should help Jorge with his new book?

        K54
        Last edited by klaus54; 09-01-2014, 08:15 PM. Reason: typos and fix em ups

        Comment


        • #5
          INTRODUCTION 1


          Most, if not all science papers I’ve read, have an introduction which is always worth a read. Why is it so good? Well the introduction is generally quite readable by a layperson, providing a lot of background information to the experiment being reported and often this includes a lot of the history of ideas leading up to the experiment.



          Accordingly, one learns much from that one section. And so it is with this paper.

          The authors begin with the most obvious of statements:-

          Originally posted by first link in OP
          The origin of new genes has always been an intriguing evolutionary question [1].
          

Indeed.

          They continue by listing the mechanisms by which new genes can arise. I’ll do so, but offer a brief comment on what each is about:-



          1) Gene duplication. When DNA duplicates during cell division either a portion of the DNA or the whole chunk of DNA gets duplicated. Naturally, this duplication can include genes, chromosomes and even the whole genome of an organism. One of the duplicates is free to diverge by collecting mutations, while the other gene maintains the original function. 



          2) Exon shuffling. Genes generally are composed of exons and introns, the former being stings of DNA which code for parts of a protein, while the latter are strings of DNA which don’t code for protein, and the protein making machinery snips the introns out, when protein is being made. Some times, exons from different genes are brought together and new genes can arise from this process.



          3) Retroposition. Here, bits of any RNA get transcribed back into DNA, and in the process add to the DNA, sometimes producing a new functional gene.

          

4) mobile elements. Sometimes known as jumping genes. Jumping genes are bits of DNA that can move around a chromosome, inserting themselves anywhere and thereby, sometimes, causing new genes to form. But there are other kinds of mobile element, for example, plasmids, which are often bits of DNA added to a genome from outside, by a bacterium.

          

5) Lateral gene transfer. Also known as horizontal gene transfer. This is when different organisms exchange genes, and is not the same as gene exchange by sexual reproduction. Bacteria, for example, can exchange genes simply by one bacterium injecting a fragment of its DNA into another bacterium with a different set of genes.



          6) Gene fusion/fission. Sometimes different genes can fuse or break apart, thereby bringing about a novel gene.

          

7) De novo origination. The topic of this set of posts.

          

While mechanisms 1) - 6) have been accepted as the various means by which new genes arise in nature, the seventh process had, for a long time, been considered to be unlikely, if not impossible.


          In the next post, I will explain the reasons for this.



          To be continued ....


          Last edited by rwatts; 09-02-2014, 08:31 PM.

          Comment


          • #6
            INTRODUCTION 2


            
In the previous post we learnt about the six different mechanisms by which new genes form. A seventh was mentioned, “de novo origination” whereby a gene comes into existence where no gene existed before.

            

It was noted however, that this was deemed by researchers to be impossible at worst and most unlikely at best. In this post we have a look at how this attitude came to be, according to the introduction provided by the paper linked to in the OP.

            By the 1970s the structure of DNA had been resolved and its role in hereditary and common descent was basically understood. And scientists were learning how to locate protein coding genes within strands of DNA. Accordingly, questions could begin to be asked as to the sources of variation between organisms and the role this had in evolution. In 1970, a Japanese researcher Susumu Ohno wrote a very influential book arguing that new genes came from pre-existing genes by a mechanism called gene duplication. He provided a wealth of evidence to back his claim and in the process argued that a new gene forming from a random sequence would be highly unlikely.



            Hence, not only had an influential researcher said “no” to the denovo origin of genes, but his influential book perhaps side tracked genetics for a bit while people digested his argument and began to focus on gene duplication as a means of evolution.



            Then in 1976 and even more influential scientist, the Nobel Prize winner Francois Jacob, wrote an article for Science in which he gave a resounding “no” to the idea of denovo origination. His argument came in this paper:-

            Evolution and Tinkering

            

The paper is well worth a read because in it, Jacob presents the case for a mechanism of evolution that is now well appreciated - evolution by tinkering, or expatation, or co-option, whereby new functions evolve by coopting existing systems and using them in new ways. Gene duplication is a major factor behind this because it creates a copy of an existing system (gene, set of genes, or gene regulator) and one of the pair can evolve the new function, while the other of the pair maintains the initial function.



            Thus, some thirty to forty years ago, two very influential researchers gave a thumbs down to the idea in books and papers they had written which nevertheless showed plausible ways by which evolution could proceed.



            For some decades then, researchers focused on exploring these means of evolution, and the idea of denovo formation of genes was largely ignored.



            Recently this ignoring and ignorance has begun to fade and in the next post we see why this happened.






            To be continued ....

            Last edited by rwatts; 09-03-2014, 04:01 PM.

            Comment


            • #7
              INTRODUCTION 3

              In the first post describing the introduction, I briefly discussed how new genes come about, and one of those mechanisms was denovo origination, whereby new genes arose from random sequences of code which lie between the protein coding genes of an organism’s genome. While the other six mechanisms had been well established, this seventh mechanism was largely discounted and I showed why in the following post.

              Now I look at the reasons for this situation changing in recent years.

              First off, the paper mentions two things which must happen for a random sequence of code to begin to become a new gene:-

              Originally posted by first link at OP (rearranged by me)
              (1), the DNA must be transcriptionally active, and
              


              (2) it must evolve a translatable open reading frame.



              However, these two steps can occur in either order.
              

The reasons for this are that to produce functional protein, DNA first has to be transcribed into messenger RNA by the cellular machinery. Once transcribed, the messenger RNA then has a chance of being translated into protein. However, to be translated, the messenger RNA needs to have a ‘start here‘ point so that the cellular translation machinery can pick it up and begin reading off contiguous sets of three bases (codons) which define the strings of amino acids making up the polypeptide, or protein if the polypeptide turns out to be functional.



              It doesn’t matter whether the active transcription happens first then the translation open reading frame (ORF) mutates into existence next, or whether the translation ORF mutates first then transcription follows later.

              So why did the scientific community begin to change its mind with respect to the denovo origin of genes? The authors describe a number of key papers that were published. I will list them shortly. I have not read the references to establish the means by which the scientists determined that DNA from non coding regions was in fact being translated into polypeptide. This topic will soon become important in the context of the paper I’m describing here, but whether or not these other researches used the same methods, I don’t know. Anyway, the reports which changed people’s minds were:-

              

1) In 2006, some 36 years after Ohno’s work on gene duplication, the pioneering research was done and published in this paper - Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression.

              2) Around the same time another group found newly evolved genes in related species of fruit fly. See reference 7 at the link in the OP.

              3) Over the following years, other groups reported examples of more such genes and in particular the group under Zhou (see reference 10 at the link in the OP) estimated that around 12% of new genes in Drosophila had originated denovo, as opposed to the other ways already discussed. The authors note however, that it is unclear whether or not many of this 12% of genes actually code for functional protein. The polypeptide produced from the ORF may in fact be useless and simply be broken down by the cellular machinery. I presume that this research detected DNA from the non coding regions being transcribed and then translated, without functionality being determined, thereby bringing this 12% into question.

              

4) In 2009, Knowles and colleagues found three possible denovo genes within the human genome. They were labelled CLLU1, c22orf45, and DNAH10OS, and are mentioned here because they become interesting in the context of the research reported in this paper. Knowles et al. used transcription and translation evidence to argue their case and reasoned that perhaps as much as 0.075% of the human gene set of some 25,000 genes, arose denovo. This makes for around 200 genes. Reference 5 in the link at the OP is the relevant paper here.

              5) Then in 2010, Li and colleagues identified another human denovo gene, this one labelled C20orf203 and which was associated with brain function. This last point is also interesting in the context of the research I describe here, because many of the genes found by this report also are associated with the brain. See reference 11 in the link at the OP.



              6) And so the pace has picked up, with denovo genes being identified in Saccharomyces cerevisiae (baker’s yeast) in, the mouse, a species of the malaria parasite, and rice over succeeding years - references 12 to 16 in the link at the OP.



              Despite the all of the above evidence, people continue to think that the denovo origin of genes is very rare. Yet, some of the above research suggests otherwise, and so the rate of formation of these genes remains very much unknown.

              The authors hypothesized that, notwithstanding gene duplication, denovo gene formation should be reasonably frequent and so they went looking. By comparing DNA and protein databases for humans and closely related primates in a very conservative search, they uncovered 60 denovo originated protein coding genes specific to humans. They mustered both transcriptional and proteomic evidence to support their finding, that is, evidence for these genes were found in both DNA and protein databases thereby demonstrating that such an origin for genes is not exactly rare and certainly not impossible.


              In the next post we begin to examine how they uncovered this evidence, in the process learning what open reading frames are, and why they are important.



              To be continued ...

              Last edited by rwatts; 09-05-2014, 04:02 PM.

              Comment


              • #8
                Originally posted by rwatts View Post
                Hi All,


                

I read another paper and think I understand it reasonably well, which made it an interesting paper.

                And so I have the urge to share.


                The origin of new genes is an important topic in evolutionary biology. While there are several different mechanisms behind the formation of new genes, which will be described soon, in the following set of posts, I will be discussing this:-

                De Novo Origin of Human Protein Coding Genes

                - paper, which deals with a mechanism of new gene formation which is only just being appreciated.



                In essence, this process causes new genes to form from the vast stretches of DNA lying between the protein coding genes. Naturally, protein coding genes are those which do code for functional DNA, and the stretches of DNA between are generally named “junk” because it seems that these regions have little to no function. However, scientists are beginning to find evidence that they may be regions from which new genes can arise. 



                To find examples of these denovo genes, researchers have to trawl massive databases looking for strings of DNA in close relatives of humans (chimpanzees and orangutans) that are similar to strips of DNA in humans but which do not code for identifiable proteins in any of these organisms and were non translatable in the chimps and orangutans, but were translatable in humans. This latter point means that in humans, mutations had occurred in the DNA to cause translation ‘start here’ codons to evolve. Once DNA is translated into polypeptide then at least it is open to selection and is potentially functional, and at best that polypeptide is a fully functional protein, and the underlying DNA is an actual gene.



                (The reader needs to remember that the genome of an organism consists of massively long strings of DNA comprised of a sugar backbone from which bases, labelled A, C, T or G are fixed. Sets of three bases define an amino acid, strings of which go to make up polypeptides or proteins. To get from DNA to protein, the cellular machinery first transcribes DNA into messenger RNA, which is very much like DNA, which is then translated into strings of amino acids which are called polypeptides. Polypeptides which are functional are what we call proteins. To refresh your memories, and possibly confuse you even more, there is this which adds a bit of technical detail to my very sloppy description.)

                Because humans, chimps and orangutans share common ancestry, these similar strings of DNA were passed on via the process of common descent. The common ancestor had the string and it was passed on to the lineages that became chimp, orangutan and humans, subsequently diverging in each lineage thanks to the accumulation of point mutations. But they did not diverge enough such that all traces of common ancestry were gone. That they were not translated in chimps and orangutans, but were translated in humans meant that in humans these strips had mutated to the point of potentially being functional in humans.


                Finding these comparable strips of DNA proved to be a challenge and the number of them suggested to the researchers that denovo origination of new genes from the non coding regions of a genome may be more frequent that had previously been thought.



                Because I think I understand the above paper reasonably well, in the next set of posts I will describe what the paper reports, firming up my own understanding and passing on to the reader something of the idea regarding a means by which new genes are thought to arise in organisms.



                This has been my introduction. In the next post I discuss the paper’s introduction.






                To be continued ....


                I quickly scanned over the posts between you (rwatts) and Santa "Buffoon" Klaus54.
                [Seems to be a duet which I thought I'd interrupt]

                I was like ............ more bedaffled than Daffy Duck!

                Listening to you guys expressing your beliefs in this nonsense actually hurts my brain.
                I am not being sarcastic, dramatic or trying to be funny - it actually has that effect.

                The biggest thing that gets me is the amount of blind faith that you people exhibit
                without realizing, and much less acknowledging, that you are indeed exercising HUGE
                amounts of faith (not to mention ignorance and selective adaptation).

                Then, to top things off, your FAITH prevents you from seeing / considering anything
                that isn't in agreement with that faith. It's as if your few working neurons automatically
                shut down the instant that they perceive a threat to your Materialistic beliefs.

                Truth be told, I pity you - I truly do!

                P.S. This is my one and only post in this thread - no need for anything more.

                Jorge

                Comment


                • #9
                  Originally posted by Jorge View Post
                  I quickly scanned over the posts between you (rwatts) and Santa "Buffoon" Klaus54.
                  [Seems to be a duet which I thought I'd interrupt]

                  I was like ............ more bedaffled than Daffy Duck!

                  Listening to you guys expressing your beliefs in this nonsense actually hurts my brain.
                  I am not being sarcastic, dramatic or trying to be funny - it actually has that effect.

                  The biggest thing that gets me is the amount of blind faith that you people exhibit
                  without realizing, and much less acknowledging, that you are indeed exercising HUGE
                  amounts of faith (not to mention ignorance and selective adaptation).

                  Then, to top things off, your FAITH prevents you from seeing / considering anything
                  that isn't in agreement with that faith. It's as if your few working neurons automatically
                  shut down the instant that they perceive a threat to your Materialistic beliefs.

                  Truth be told, I pity you - I truly do!

                  P.S. This is my one and only post in this thread - no need for anything more.

                  Jorge
                  Did that rant have something to do with de novo genes??

                  K54

                  Comment


                  • #10
                    Originally posted by Jorge View Post
                    I quickly scanned over the posts between you (rwatts) and Santa "Buffoon" Klaus54.
                    [Seems to be a duet which I thought I'd interrupt]

                    I was like ............ more bedaffled than Daffy Duck!

                    Listening to you guys expressing your beliefs in this nonsense actually hurts my brain.
                    I am not being sarcastic, dramatic or trying to be funny - it actually has that effect.

                    The biggest thing that gets me is the amount of blind faith that you people exhibit
                    without realizing, and much less acknowledging, that you are indeed exercising HUGE
                    amounts of faith (not to mention ignorance and selective adaptation).

                    Then, to top things off, your FAITH prevents you from seeing / considering anything
                    that isn't in agreement with that faith. It's as if your few working neurons automatically
                    shut down the instant that they perceive a threat to your Materialistic beliefs.

                    Truth be told, I pity you - I truly do!

                    P.S. This is my one and only post in this thread - no need for anything more.

                    Jorge
                    A rant is all you have, isn't it Jorge.

                    Comment


                    • #11
                      Originally posted by Jorge View Post
                      P.S. This is my one and only post in this thread - no need for anything more.

                      Jorge
                      Coward. Rant then run off.

                      That paper is all about an experiment that was performed. Why not actually criticise the experiment, and the interpretation they placed on the results Jorge? The point is, you can't, can you. Despite all your hubris, you lack the knowledge to be able to offer a reasonable critique.

                      That is what is so silly.

                      You claim to be doing the work of God, yet rant then run is the best you offer.

                      Comment


                      • #12
                        Originally posted by klaus54 View Post
                        Did that rant have something to do with de novo genes??

                        K54
                        A frightened fellow isn't he. Lacks the ability to read and understand the paper and so has no means of arguing against it. Hence the rant, followed by the running away.

                        Comment


                        • #13
                          AN ASIDE. A BIT OF BACKGROUND


                          

Before I actually begin the discussion of their results, I need to take a detour. This research relied on several very important databases and search algorithms, the most important of which I will mention or describe here.

                          

This is my understanding only, so please don’t rely on it to get through any exams. :(

                          

Because of link posting limitations, you will need to Google many of the terms yourselves, if you want more information.

                          The main search algorithm is the Basic Local Alignment Search Tool, otherwise known as BLAST. The algorithm has several different versions, BLASTN, for searching for DNA nucleotide sequences (the strings of adenine A, guanine G, cytosine C and thymine T (or uracil U)), BLASTP for searching for amino acid sequences of polypeptides or proteins (the strings of serine, arginine, aspartic acid, lysine plus the other 15 acids), and BLASTN EST for searching for expressed sequence tags, or short strips of transcribed DNA indicating associated gene activity. All were used extensively in this study.



                          Across the globe, some researchers are uncovering the DNA and RNA sequences of various organisms, and this information goes into databases such as Ensembl. However, elsewhere across the planet, other researchers are uncovering the amino acid sequences of the various polypeptides and proteins for different organisms, and these sequences are being placed into other databases such as PRoteomics IDEntifications (PRIDE) or the Peptide Atlas. 



                          And its generally so that there is no direct mapping between the various nucleotide (DNA/RNA) databases and the various protein databases. This is because a sting of bases on DNA does not exactly spell out a string of amino acids for a protein. And the teams of researchers working on DNA are not the same teams working on protein. Two different scientific disciplines in the main. Remember that, for complex organisms at least, most DNA appears to be non functional. Furthermore, genes have a complex structure, being made up of exons and introns, the latter specifying the protein, and the former ultimately being snipped out when the protein is being built up from the messenger RNA which gets transcribed from the underlying DNA. An amino acid is defined by a set of three contiguous bases (called a codon) in the messenger RNA which is itself a complimentary version of the bases in the DNA.

                          

Because of this, it can be tricky locating the bit of DNA in a massively long sting of DNA that is likely to constitute a gene that gives rise to functional protein. And this is made all the harder by the fact that DNA sequences are massive and are being uncovered by the ton, each day. DNA, protein and other data are simply being poured into databases, somewhat like water over the Niagara Falls.



                          This is where lots of computing power comes in.



                          So, given a massively long string of DNA bases (for example ...ATTGACTTAACCTT...) exactly how do scientists begin to find where a possible gene might be?



                          They look for what are called Open Reading Frames (ORFs).

                          

Look at the figure at the link just given, showing three rows of bases with purple “start” codons and red “stop” codons.



                          Remember that with DNA, every set of three bases defines one amino acid in a protein. Also a specific set of three bases (ATG) defines a start point for the gene transcription machinery. The transcription termination point however, is defined by a complex interaction between the transcription machinery and the structure of the messenger RNA being created by that machinery including perhaps other proteins. However, before that transcription stop point is reached, the transcribing system needs to establish a stop sequence for protein translation which happens shortly after. Three sets of three bases (TGA, TAA and TAG) can define the stop point for translation. So, transcription begins at a transcription start codon (ATG), transcribing each base into a corresponding but complimentary base of messenger RNA, where each of three bases will later define one amino acid for the protein which is built during translation. The transcription machinery then transcribes a translation stop code into the messenger RNA, then terminates itself.

                          

The question for the software is this - how to locate these ORFs How to locate that first ATG and then the following sequence that contains no stop codons?



                          This is where it gets even trickier for the searching algorithm.

                          

It starts at the beginning of the string of DNA, at the very first base. In the figure at the link just above, on row 1, the algorithm has found the first ATG. Note where the translation stop codon is, in red, to the right. But it maybe so that the search algorithm, beginning at the first base on the string of DNA gets down to this point, and that A in the first row, is actually the last base in the previous codon as opposed to the first base in this codon. This means that the possible start codon is really as shown on row 2 along with the translation stop codon a little bit further downstream. This potential gene is much shorter than the one revealed in the first row. But it may be so that the search algorithm, beginning at the first base on the string of DNA, gets to this point, and that A and T shown in row 1, are actually the last two base pairs in the previous codon, as opposed to the first two base pairs in this codon. So the transcription start and translation stop points are as defined in the third row.



                          Thus, with the BLAST algorithm searching from left to right, three possible gene sequences have been found.



                          However, while genes can be read in the sense (left to right) direction, they can also be read in the anti-sense (right to left) direction. And so the search algorithm as to do the same kind of thing in reverse, producing six potential sequences for analysis.

                          

(Imagine doing this kind of search across a sequence, a billion bases long, a sequence that has just been put into the database, without any knowledge as to which bases constitute genes, regulators, etc. Needle in a haystack stuff.)



                          Given six possible candidates, which sequence is taken as the ORF? Generally it’s the longest one, because genes tend to be longer than shorter.



                          Thus, in the figure at the link above, the ORF is that in row 1, from the ATG up to the TAA. It’s the longest and it contains no intervening stop codons as seen on the following two rows.



                          I think that is how it works.



                          Of course, this does not demonstrate that the actual selection constitutes a bona fide gene. However, from this sequence, amino acids can be established and therefore protein databases searched to determine if associated proteins exist, and the EST database searched to see if associated gene transcription has in fact occurred, and the nucleotide databases can be searched to see if messenger RNA has been transcribed.




                          Therefore, having told you something of my understanding of how genes are often located, in the next post I can return to the paper and begin to explore how they uncovered the evidence for the denovo genes.





                          To be continued ...

                          Comment


                          • #14
                            RESULTS: Search for De-Novo Genese in the Human Lineage 1


                            How did they uncover these denovo genes?

                            Well, I’ll let them explain then offer my own understanding.

                            


                            Originally posted by link at OP
                            We performed a simple, conservative, but systematic pipeline to search for genes that originated de novo in the human genome since divergence from the chimpanzee (Figure 1). All human protein sequences were searched using BLASTP against the protein databases of other primates, i.e. chimpanzee, orangutan, rhesus macaque, and marmoset, with orthologs identified using an E-value threshold of 10−10. After the BLAST procedure and excluding proteins shorter than 100 amino acids and short protein sequences from alternatively spliced genes, we retrieved 584 genes from the human genome that did not have a hit in other primates. Human sequences that did not have a start (i.e., ATG) or stop codons were excluded and the remaining 352 genes were searched using BLAT against the chimpanzee and orangutan genomes in the UCSC database (http://genome.ucsc.edu/, [19]) to identify orthologous sequences. In addition to the bioinformatic analyses all of the sequences underwent extensive manual checks. Human genes for which an orthologous gene region (i.e., highly similar sequences) could not be identified in the chimpanzee or orangutan were discarded. Genes that had many duplicates in the human genome were also discarded. To be a candidate de novo originated gene, in addition to having a potentially translatable open reading frame in the human genome, the gene must have been present, and disrupted (i.e., non-translatable), in both the chimpanzee and orangutan genomes, e.g., the chimpanzee and orangutan sequences must lack an ATG start codon or have frameshift-inducing indels or nucleotide differences that result in a premature stop codon. Chimpanzee and orangutan sequences lacking only an ATG start codons were searched to determine whether they had alternative start codons, either upstream or downstream of the human ATG that could generate frame complete translatable open reading frames. Chimpanzee or orangutan genes that possessed premature stop codons but retained predicted protein lengths longer than 80% of the human proteins were discarded for analysis, while those with predicted proteins that were shorter than 80% of the size of the human proteins were kept for the analysis of human de novo genes (see Dataset S1). To exclude the possibility that the new gene had been generated in the primate ancestor and then lost in parallel in both the chimpanzee and orangutan lineages we searched for human specific mutations that were responsible for generating the completed protein-coding open reading frame. Only those genes that had a human specific mutation that generates an open reading frame and where both the chimpanzee and orangutan retained the ancestral state at these positions, thus disrupting the open-reading frame, were kept (see Dataset S2). These stringent criteria yielded a set of 46 genes. Lastly, the coding sequences of these 46 putative de novo human genes were used as queries in searches of databases for evidence of expression at the mRNA and protein level. Expression at the mRNA level was assessed by BLASTN searches of the NCBI (http://www.ncbi.nlm.nih.gov/) nr (non-redundant) database, to search the corresponding matched expressed mRNA sequence, and the UCSC (http://genome.ucsc.edu/) EST database, to search for short expressed sequence tags. Evidence for the existence of the protein was obtained through searches of two proteomic databases, PRIDE [20] and PeptideAtlas [21] (Dataset S3). The PRIDE and PeptideAtlas databases are composed of peptide sequences derived from proteomic experiments. Searches of these databases resulted in the identification of 27 novel human genes that have matching expressed mRNA sequences in the GenBank or UCSC databases, thus must be transcribed, and also have evidence for being translated as they have matching peptides from the proteomic databases (Table S1). The mRNA evidence suggests that none of these human genes have splice variants.
                            Hmmm. The problem is that this is written by experts for experts who actually understand what is going on. It’s not written for the layperson, and there is a lot that is not said. Hence I’m going to describe what I think is going on.

                            The above is encapsulated in figure 1 at the link in the OP and is perhaps worth looking at.

                            As I describe it, hopefully you will see that their search was very conservative. The aim was to find DNA sequences that were similar between humans, chimps and orangutans but which were expressed in humans and not expressed in the other primates, and which also showed conservation between the chimps and orangutans. That is, the common ancestor to humans, chimps and orangutans had the sequence, and when the three primate lineages diverged, that sequence came with each lineage, only evolving into a functional gene in the human line.

                            

First, using BLASTP, they searched the protein databases of chimpanzee, orangutan, rhesus macaque, and marmoset looking for orthologs of the known human proteins. These known proteins came from databases named “Ensembl”. Ensembl itself is an automated system predicting gene locations and allowing them to be tied to proteins, once the latter become identified. They used an “Expect value” of 10-10. This E value describes the number of hits a search would expect to see, based on chance alone. So they were searching these other primate protein databases for the other primates, trying to locate orthologs of the human proteins, with an extraordinarily low value of finding any due to chance.



                            From this search they retained the human proteins that were not found on the other databases. That is, they were unique to humans. I presume they began with proteins because, compared to genes, proteins are straightforward stings of amino acids, as opposed to genes which have complex structures. However, this gene set could not be deemed to be indicating the set of denovo genes because the underlying genes could have arisen by duplication or anyone of the other 6 methods described earlier.

                            From this set of unique human proteins they removed very short ones as well as proteins derived by alternatively splicing genes.

                            They had 584 human genes by this method. They removed from this list those genes that had no identifiable start (ATG) or stop codons, leaving 352 sequences.

                            These sequences became inputs to the BLAST Like Alignment Tool (BLAT), another search algorithm but one that is a lot faster than BLAST. These were run against chimp and orangutan databases, searching for orthologous sequences there.



                            Human genes which had no orthologs in the chimp and orangutan databases were removed from the list. This might be beginning to sound a little bit strange. Didn’t they want sequences unique to humans? Yes, but not completely so. They wanted sequences with orthologs in the other primates because these are shared, derived from the common ancestor. The thing to find are those sequences that are switched on and active in humans but silent in the other primates. Also removed were those genes which in humans, had many duplicates (they probably arose by duplication). Furthermore, the sequence in both the chimp and orangutan databases had to be non translatable because it lacked a start codon or had a premature stop codon. Thus, while the sequence existed on the databases of the other two primates, it could not be translated into protein, meaning that common ancestors between the three groups would have shared the sequence, but it only became functional in the human lineage.

                            Various other discard criteria were applied. For example the possibility of alternative start codons caused sequences to be removed from the list, as well as sequences containing premature stop codons, but gave rise to predicted protein that were nearly as long as the existing human protein.

                            Also excluded were sequences that could have evolved in a distant common ancestor, but were lost in parallel in both the chimp and orangutan genomes while being retained in the human genome. They did this by ensuring that the human sequences had the mutations which generated the complete open reading frames, mutations which were absent in the other primates and the associated sites being conserved (showing no evolution). This meant that the ORF had actually evolved in the human lineage, not beforehand.



                            This left the researchers with a candidate list of 46 genes.



                            So they had candidate sequences that matched between human, chimp and orangutan. The question was, did these sequences express messenger RNA (the outcome of gene transcription). So again, additional databases were searched, the NCBInr and the UCSC EST, for evidence that these sequences were transcribed. Then the protein databases, PRIDE and Peptide Atlas were searched to see if these sequences generated protein, that is, the messenger RNA was translated into protein.

                            The outcome of all this were the 27 novel human genes.

                            These last searches had me puzzled somewhat, given that the 46 genes located previously all came from initial protein searches. However, presumably these final searches closed the loop by demonstrating both transcription and translation. Furthermore, given that Ensembl gene/protein matches were used, presumably there was always the possibility of a “false positive” in the context of uncertain protein identification. So these final searches weeded out any false positives.

                            Importantly, the searches also defined where the genes were expressed which has bearing on gene functionality which will be discussed later.

                            


A final thing to bear in mind when talking about sequence matching. Matches are detected within a certain threshold. This is because sequences for the same thing (a specific protein or a specific gene) vary between individuals as well as between species. Hence when a match for a specific human sequence is detected on say the orangutan database, it’s generally not a precise match.


                            Anyway, the upshot of all these comparisons was that set of 27 putative novel human genes.

                            With this list, the authors go on to discuss their findings and there are one or two surprises.


                            To be continued ...


                            Last edited by rwatts; 09-11-2014, 11:39 PM.

                            Comment


                            • #15
                              RESULTS: Search for De-Novo Genese in the Human Lineage 2




                              In the last post we saw that to locate potential denovo genes, researchers had to compare several databases of sequences for humans, chimps, orang-utans and other primates, ultimately looking for DNA sequences that were similar between humans, chimps and orang-utans, but were actually translated to protein with humans, but not with the other two primates, but nevertheless were conserved sequences between the chimps and orang-utans.

                              This was evidence that all three lineages inherited the sequence from a common ancestor and only in humans did it evolve to become a functional gene.

                              In locating the 27 denovo sequences in this study, they found that the CLLU1, c22orf45 and DNAH10OS denovo genes positively identified in an earlier study (and mentioned in a previous post), did not show up in this search, despite the fact that the earlier study was very thorough.

                              The researchers of the previous study had used different (earlier) Ensembl databases. Version 46 for the earlier study and version 56 for this study. I’m not entirely sure why this lack of identification happened. However, these databases are constructed from automated sequence searches and presumably changes to the underlying algorithms (as more becomes known) can cause discrepancies. Possibly too, different versions of the databases contain different segments of DNA.

                              c22orf45 and DNAH10OS don’t show up in the 56 version, but do in the older 46 version. CLLU1 shows up in both versions, but failed to show up in the latest PeptideAtlas, (and remember to be counted as denovo in this study, sequences had to pass all screening tests). However, the peptides associated with the c22orf45 and DNAH10OS sequences did show up in the Atlas. In other words, these previously identified genes did not appear in the new study because they failed to pass every rigorous step in the screening process.

                              These discrepancies encouraged the researchers to use their pipeline, but this time beginning with earlier versions of Ensembl prior to version 46. An additional 33 denovo genes were identified. This time only the DNAH10OS from the other study was identified.

                              That these 60 sequences, 27 from the fist run, and 33 from the alternative run are bonafide denovo genes is enhanced by the fact that all 60 sequences were subsequently derived using the same screening system but with the very latest (at the time of writing) version of the human genome.

                              Given the conservation of the sequences between the orang-utan and chimp lineages as well as their non translatability in those lineages but their translatability in the human lineage, the authors argue that these sequences became denovo genes from non coding regions, after the human-chimp split from a common ancestor.

                              The sequences are not found as duplicates on the human genome, indicating that they did not arise via gene duplication. On the other primates, the orthologous sequences occur as single sequences as well, except for one gene which exists as a disrupted duplicate on the orang-utan. How the bases diverge on each sequence for each organism also reinforces the 1 to 1 relationship between the sequences from each organism. All putative denovo genes, bar one, existed as single exons too, as opposed to having a more typical complex gene structure comprising exons and introns. And only one of the genes is located on the sex (X) chromosome. The other genes are associated with other body tissues and organs.

                              Next the researchers needed to determine whether or not these sequences were fixed in the human population. They did this by looking at another database, that containing the human halotype map (HapMap). The idea is that when a gene comes into existence in an individual within a population, if it is a beneficial gene, it will likely spread through a population as each new generation comes into existence. Eventually it will come to completely dominate a population, at which point it is said to be fixed. If a gene is found in some individuals but not in others, then the gene cannot be said to be fixed and perhaps over time will be removed from a population. Each gene (bar one) was found to be in every human variant, thus showing no example of gene deletion or insertion. Hence the genes were likely fixed. If some human variants were missing a particular denovo sequence say, then the sequence was not fixed and the possibility was that while it had arisen alright there was no evidence that at some time in the “near” future, it might be removed from the population. In the case of one gene, a polymorphism was found whereby the gene had a mutation causing a premature stop codon, suggesting that it was not (yet) fixed.

                              The upshot was that 59 of the 60 denovo genes were fixed in the human population, leading the researchers to suggest that denovo gene origination and subsequent fixation is not exactly rare. Put it this way, in the context of the earlier claims by Ohno and Jacob, denovo origination was quite possible. 

Since the human/chimp spilt, an estimated 10 to 12 genes per million years had been “bubbling” into existence from the DNA “junk” regions.


                              So the next question was - what were these new genes doing?



                              To be continued ….

                              Comment

                              Related Threads

                              Collapse

                              Topics Statistics Last Post
                              Started by Hypatia_Alexandria, 03-18-2024, 12:15 PM
                              48 responses
                              135 views
                              0 likes
                              Last Post Sparko
                              by Sparko
                               
                              Started by Sparko, 03-07-2024, 08:52 AM
                              16 responses
                              74 views
                              0 likes
                              Last Post shunyadragon  
                              Started by rogue06, 02-28-2024, 11:06 AM
                              6 responses
                              46 views
                              0 likes
                              Last Post shunyadragon  
                              Working...
                              X