Previously on archetyp.al, I demonstrated a bioinformatics use for Hadoop MapReduce. The idea was to build on the ubiquitous word count example, but using a problem which is at least somewhat relevant to bioinformatics. So I read in a VCF file and parsed out the reference and the variant bases, and collected an overall count of the mutation spectrum. So here we are, back at it with an Apache Spark version of the demo. Check out my listing of useful spark links for a direction to go after this demo.
Why Spark? There are a lot of reasons to go with Spark instead of MapReduce, but for me the most convincing reason is time. In this case the MapReduce solution written in Java takes north of 100 lines of code to set up. Granted, some of that is taken up parsing VCF lines for details which go unused, but 100 is a fair count. Now have a look at the Spark solution:
What is this code doing? If you are familiar with Scala, it accomplishes what it looks like it’s doing, but via different means. Start out defining a singleton object which contains the main function to be invoked upon execution. The SparkContext object is used to set up the Spark job. Then the input file is loaded from the first command line parameter, arg(0). All header lines, those which begin with a ‘#’ are filtered.
The refVar = ... line is where I drop the first three columns and take only the next 2 columns, which are the “reference” and “variant” columns, respectively. I feel like there’s a better way to select arbitrary columns to keep than the take/drop combination. If you have a suggestion, please leave a comment. Then the collection of ref and variant columns is filtered for length == 1, because we want to remove any indels or structural variants.
Line 12 is really where the magic happens. This is the step that is explicitly a MapReduce operation. The map function is provided a lambda (or anonymous function), which is run on each member of the refVarNoIndel collection. This lambda function simply concatenates the two columns, ref and var, into a single string “ref -> var” and then constructs a tuple with that string as the first value, and the number 1 as the second value.
Then, following immediately after the map operation, and on the same line no less, is the reduceByKey function. For those unfamiliar with Scala, the odd looking lambda, (_ + _) is shorthand for (a, b => (a + b)). This defines how the reduce function combines the values it receives from the map function, it adds each data point to a running total by key, and returns a list of those keys and their respective totals.
Line 13 then sorts the results and pushes them out to the ./answer directory.
And that’s pretty much everything. There’s a an sbt file, which I lifted straight from the spark wordcount example. In fact, this whole demo is really an adaptation of the afore mentioned Spark demo, with a slightly different example problem.
In order to set this all up, create a folder to work in. Create a file named build.sbt in the root of your project and paste in the sbt outline from the Spark wordcount example:
With the sbt file out of the way, let’s create the proper spot for the source file mkdir -p src/main/scala/al/archetyp/biowordcount. The source file BioWordCountSpark.scala belongs in the biowordcount directory.
From the root directory of the project, run sbt package. You should see the source file get compiled, and a jar created under ./target.
The application should now be compiled and packaged up for usage. Let’s look for a tasty VCF to crunch with our new Spark app. I went to 1000genomes.org and dug up a VCF containing only variants from human chromosome 22, the shortest autosomal contig. Once unzipped, the file is 10GB, with over 1,000,000 lines. There are over 2000 samples in this VCF, which means each row has more than 2000 columns. I used the following spark-submit command to run the app on this VCF file:
time /Users/rob/bin/spark/spark-1.0.2-bin-hadoop2/bin/spark-submit \--class "BioWordCountSpark"\--master local\target/scala-2.10/biowordcount-spark_2.10-1.0.jar \/Users/rob/Downloads/vcf/ALL.chr22.phase3_shapeit2_mvncall_integrated_v4.20130502.genotypes.vcf
Your paths will vary, especially the location of the spark-submit bin and the vcf file to be counted. The “master” portion of the command directs spark to run this app locally, using as many as 4 cores. If you have a cluster, perhaps on aws, the --master argument should be replaced with the url of the Spark master.
Here follows my resuling count for the 1000genomes VCF:
When running the app locally on my 2014 Macbook Pro with four cores, it was able to complete within 1 minute and 1 second. The OSX kernel may have cached some of the data, as I did run this command several times, with slight improvements each time. However, for processing a 10GB file with 1 million rows, and over 2000 columns, 1 minute is pretty good. Just to see how this approach stacks up against a python script, I decided to solve the same problem in python. I know that the real benefit to Spark comes when you are running on a multinode cluster, but I was sufficiently impressed with this performance that I wanted to get an idea how long it would take for python to do something similar. Here’s my python solution:
When running the above python script on the same vcf, the running time was 1 minute and 40 seconds, reliably. So the speed up is not that amazing. I grabbed larger VCF, this time chromosome 1 from 1000 genomes, which is 61GB when unzipped. The Spark solution took 5 minutes and 50 seconds, while the python solution took 12 minutes and 5 seconds.
Not only is the Spark code faster to write (once you understand the basics), but it actually runs faster. Turn around times on test rus are reduced when the app itself runs faster. It is not unreasonable to assume that code which can be written so much faster and in so many fewer lines will be much easier to read and understand. The density of Scala allows the reader to see and understand larger swaths of the code with in the space of a glance. I’m lookging forward to exploring the Spark ecosystem further.
Many people who do bioinformatics (the field seems to have settled on “bioinformatician”, but I like “bioinformaticist” better) find themselves dealing with large data sets and long running processes, arranged in myriad pipelines. In our time, this inevitably demands distributed computing. Life innovated during the Cambrian explosion by going from single cells to colonies of cells. Life found a way to distribute and parallelize its processes. In order for us to properly focus our analytical microscopes on life, we imitate life in this strategy and distribute our processes across multiple CPU’s.
There are many strategies for distributing computation across multiple cores, processors, or physical computers, but this writeup will only cover the solutions provided by the Hadoop ecosystem, as packaged by Cloudera. It is the case that as it is widely used now, the Hadoop system of distributed computation and storage does not provide the best solution for long running computations that input or output little data. Even tasks which do use lots of data like sequence alignment, can resist parallelization such that the benefits of Hadoop are negated. Hadoop shines when inputs and or outputs are large in scale, but made of many independent records. Why is this?
Hadoop is a series of services, which when taken together, provide an environment for distributed computation. Each computer in a Hadoop cluster is referred to as a node. A Hadoop environment contains one or more nodes. The majority of the nodes in a multi-node cluster are known as “DataNodes”, which store the data and do most of the computation. There are usually at least “NameNodes”, which serve to maintain the state of the Hadoop distributed filesystem, but not the contents of any of the files on that filesystem, just the directory structure and the locations of the blocks that make up the files in it.
The Hadoop Ecosystem
HDFS is the conceptual core of Hadoop. This is the part that, once understood, can unlock one’s view of the whole picture. The key concept is that instead of pushing data to where computations are happening, the computation is pushed to where the data lives. A good example of this is Platform LSF or Sun Grid Engine, where the state of the art is to have NFS mounts on your compute nodes, such that when a job lands on that node, it then fetches the required inputs over the network, then does computation on them. Hadoop inverts this order of things by breaking each file up into chunks, default chunk size is 64MB, and distributing these chunks to the DataNodes. The NameNodes track which DataNode has which chunk of which file.
In order to do some computation on those distributed chunks, the MapReduce paradigm enters the picture. When computation needs to be done on a particular file, the required executables (jar files in this case) are sent to those DataNodes hosting at least one chunk of the file. The results of these computations are then gathered in HDFS, and potentially grouped by some key value. This is why large files, containing millions of small, homogeneous records are best for this paradigm. Each chunk is processed, line by line, with the results being aggregated for the whole.
From these two components, HDFS and MapReduce, the other services can be inferred or constructed. There is a distributed column-store database called HBase, there is a workflow manager called Oozie, there is an indexer/search service provided by SolrCloud.
It is doubtful that any bioinformatics shop will entirely rely on Hadoop for all its distributed computing needs. It’s more and more plausible every day, with projects like Crossbow, for bowtie alignment and Contrail for assembly. But in order to really learn how Hadoop works, I suggest an example problem.
I have selected the problem of calculating the mutational spectrum, as it is trivial to understand the mechanics of, is somewhat biologically relevant, and mimics exactly the flow of the classic Word Count MapReduce example.
In short, mutational spectrum is rate at which different types of mutations occur. Transitions are changes from a pyramidine base to another pyramidine, or a purine to another purine. A transversion is the substitution of a purine for a pyramidine, or vice versa.
My goal will be to generate a list of mutations, A -> C, A -> G, etc, with the associated number of occurences in a VCF file. The result will be several lines of text output.
BioWordCount: The MapReduce Job
The task of parsing each line of the VCF will be accomplished by a mapreduce job, which will be fed from an input in HDFS, and output an answer to HDFS as well. Begin by creating an empty maven project. There is an excellent tutorial on how to generate an empty maven project with the hadoop libraries at Cloudera’s website. Follow this, and you will end up with a project ready for our java files. Begin in the src/main/java/al/archetyp/biowordcount directory, replacing the archetyp.al namespace for one of your choosing (preferably the one you used when creating your project).
The first class we create will be the BioWordCount class, which will hold both our mapper and reducer class. For a project of any size, the practice of putting the mapper and reducer in the same class that contains our Main, would be unwise. However, my goal with this demo is to push the complexity down low enough that one can concentrate purely on what mapreduce is and how one might use it.
The four type parameters to the Mapper class that we extend are important. The first type, LongWritable, describes the type of the input key, in this case it will be an offset into the file, a line number. The second type is the value of the input. Text is a Hadoop specific type for passing around lines of text from a file. The next two types are the key and value types of the output of the mapper. For the purposes of the biowordcount app, the text key will be something like “A -> C” and the value, the IntWritable, will always be one, as each line that is processed will only yield a single transition or transversion.
Very simply, for each line of our input file, in this case a VCF, the void map(...) function above will be called. The magic of hadoop is that it doesn’t matter if that VCF is 100 lines long or 100 million lines long. This class will be distributed to each node where a portion of the file is stored and executed on the lines of the file contained therein.
Each line of the header is skipped by simply returning without writing something to the output context if the line received starts with #. Then we see a helper class I created for this project to contain the parsing of VCF lines. The relevant parts of the VcfLine class are:
The constructor method pulls in a row from a VCF file and parses out each column so we can interrogate the variant in our mapper, finding which transition and or transversion to report. Returning to the Map class, each single nucleotide variation (SNV) is written to the context. This is where the magic happens. We are writing a key/value pair. In this example, the key is the mutation type, “A -> C” for example, and the value will be one. The value is meant to represent a count of the mutations enountered that match the value.
At this point, each SNV is represented as a key value pair in flight between the mapper and the reducer. The void reduce() method of the reducer class will receive a collection containing all those key/value pairs with identical keys. For the purpose of counting the number of like mutations, these collections will be iterated over and summed up, providing a total count of the number of each mutation type.
The last line of the void reduce() function is vital, this is where the aggregated results are written out to the output directory. In order to inspect the complete project with the setup code, the mapper and reduce, along with the VCF parser, have a look at the github repo.
Compile and Run
Once these components are all put together, go to the root of the project folder where the pom.xml file lives. From here, run the command mvn install. This will run for a little while, compiling your java into bytecode and packaging everything into a jar, which should land in the target subdirectory. This jar is what will need to be sent into the Cloudera VM in order to run the job.
The Cloudera VM
In order to encapsulate the complexity of installing the Hadoop components on your dev machine, grab the Cloudera QuickStart Vm (CDH 4.6), and download it. I am using Virtual Box, but they also provide vmware and KVM versions of the virtual machine, if those suit your setup better. Cloudera also includes some parts of their own making, like Cloudera Manager, which similarly wraps the complexity of administering the services of hadoop. So all you have to do in order to start using these services, is download the vm and start it up. This step saves hours, if not days, of setup time.
Once you have downloaded the VM and started it, go dig around. It should start up with firefox open to a page that links you to Cloudera Manager. Go to CM and click on the “services” menu on the toolbar at the top, then select “all services”. It is possible to stop, start, and restart all the services here. For our purposes, you will need to start, in this order, zookeeper, hdfs, and mapreduce.
I chose to create a shared directory to transfer data and jars between the host system and the guest Cloudera VM. Once the VM is setup and the shared directory created and used, or the jar and vcf data have been scp’d into the guest machine, it is time to run the compiled jar. Once the jar and source data are on the VM, open a terminal or ssh into the VM. The first step to running this example will be to put the source data into HDFS, so that it will be available when the mapreduce job is executed. hadoop fs -mkdir data then hadoop fs -put data/mydata.vcf and hadoop fs -mkdir output should do the trick. Then execute the job by running the command:
At this point, the jar is sent to those nodes that hold chunks of mydata.vcf. Those nodes then run the various compoenents of the mapreduce process and the final result is deposited in /user/cloudera/output. This can be accessed by the command hadoop fs -cat /user/cloudera/output/part-r-00000, which should produce similar results to those shown below.
The sum total of sites is 954600, which matches exactly the number of non-header lines in the input VCF. While this does not prove the correctness of this approach, it certainly suggests that there are no gross errors where data is added or lost. It’s always a good idea to look for ways to check your thinking at each stage of the process. Disproving assumptions can be difficult, but is more likely to result in good work.
I decided to leave in those sites that did not contain variants, X -> ., so that the information would be present in the results. Again, the right course of action always depends upon the context, but I do like to leave in as much data as possible as far into the process as possible, so that the user can decide if it is relevant or not. There are different records for cases where the reference was a capital letter and when the reference was lower case. This is also information I decided to leave in the results. A lowercase letter represents a reference of the genome which lies in the repeatmaker regions, or regions of low complexity.
Do what thou wilt. As mentioned above, bioinformatics as a field is fairly young, and bioinformatics on Hadoop is even younger. The distributed, mapreduce paradigm is a valuable tool. It is not the only tool, but let’s not allow the fact that it doesn’t do everything get in the way of letting Hadoop do something, those things that it does well. With the development of more frameworks and tools for Bioinformatics on Hadoop, more adoption will be possible and it will grow.
I’m neither a doctor nor a lawyer, so consider my opinions appropriately. I began my career as a software developer at the Genome Institute (TGI) at Washington University in St.Louis. I didn’t set out to get into bioinformatics, it just happened that it was the most interesting problem to work on in St.Louis at the time I was getting my degree in computer science. So I set about learning the craft of software development and the field of bioinformatics at the same time.
As many before me have noticed, biology and computers are strangely similar at their most basic level. Both decompose into some basic operations that get repeated over and over, with important results emerging at higher orders from the basic operations. Computer processors can only do simplistic operations on a few pieces of data at a time. Add two numbers. Multiply two numbers. Compare two numbers. Algorithms of enormous complexity can be implemented as a series of much smaller operations. Such is the case with biology. DNA is transcribed into RNA, which gets translated into proteins, which interact with each other in a cascading series of pathways that in the end results in the swarm of cells that is a human being.
So when I began working on the variant calling portion of the cancer sequencing pipeline at TGI, I started to see that not only are these two fields, computers and biology, deeply similar, but that they reinforce and supplement one another. Computers are to biology what the telescope was to astronomy. Both biology and astronomy existed before their defining technologies were invented, but neither would be nearly as valuable to mankind without their paradigm-shift enabling technologies.
I was intimately involved in the process of taking raw genomic data from a sequencing machine and shepherding it through the pipeline of steps required to produce some actionable output. In the case of cancer sequencing, the output is mostly studies on large aggregations of tumor samples to try and discover how varieties of cancer develop and spread. But sometimes the output is meant to used for clinical sequencing. This is the sequencing of one specific person and their tumor, to determine if the drugs we already have are good candidates for treatment. I was up to my elbows in other people’s DNA. I saw the variants in their DNA that determine their health. It was like reading the source code for a human being. So naturally I became curious about my own DNA.
23andme provides a personal genome service that can determine which variants an individual has at one million pre-selected sites in the genome. All you need to do is spit in a tube and mail it to them. A few weeks later they let you know when the sequencing and analysis is done, and you get to revel in the information about your source code, just like people with access to big labs, like TGI. To be sure, this is not the same as whole genome re-sequencing, where each and every base of the entire individual is determined. That is a bit more costly, but getting cheaper every day. 23andme provides answers to specific questions like “do I have the BRCA mutations which make me much more susceptible to breast cancer?” The answers you get from 23andme are answers to well studied questions.
By this time, I had become more involved in the field of bioinformatics and ended up at a commercial bioinformatics software shop here in St.Louis, Partek Inc. In order to fuel their employees’ interest in genomics and spark their thinking on use cases, they paid for all of us to get sequenced by 23andme. So in the summer of 2012, we each received our results and marveled at the vast amount of detail. Each question answered by sequencing is backed up by a plethora of links to published journal articles describing the function and studies done on that particular variant.
Here is how 23andme describes the state of my genome with respect to the above mentioned BRCA mutations:
No copies of the three early-onset breast and ovarian cancer mutations identifiable by 23andme. May still have a different mutation in BRCA1 or BRCA2.
You have to be familiar with how genetics works in order to properly parse this statement.
Enter the FDA. The FDA has recently ordered 23andme to stop providing their service to consumers. One of the issues they cite, is providing BRCA data to consumers:
For instance, if the BRCA-related risk assessment for breast or ovarian cancer reports a false positive, it could lead a patient to undergo prophylactic surgery, chemoprevention, intensive screening, or other morbidity-inducing actions, while a false negative could result in a failure to recognize an actual risk that may exist.
This is an alarming point to raise. No one wants to needlessly cause “morbidity-inducing actions” or give a false sense of security. So let’s unpack both the false positive case and the false negative case.
A false positive would be asserting that the customer has a specific mutation, when they do not in fact have that mutation. What are the odds of this happening? I don’t have specific data on the 23andme lab, but similar technologies result in errors on the order of 2.5 calls in 100,000. This is indeed a small probability. This is in the neighborhood of flipping a coin and getting 16 heads in a row. It’s quite unlikely, but not unthinkable. So if you get 1,000,000 variant calls from 23andme, somewhere in the neighborhood of 25 will be incorrect. There is a much deeper error model built around sequencing and variant calling, but I will only glance off the surface for today’s purpose. The FDA is concerned that “morbidity-inducing actions”, like a mastectomy, could be undertaken based on faulty evidence. However, very few customers of 23andme are going to read their results, find a BRCA positive answer, and immediately perform a self-mastectomy. At some point, the medical establishment gets involved. There are orthogonal methods of validating the BRCA carrier status. 23andme is purely an informational service, meant to be the beginning of a conversation, not the last word.
The case of a false negative is on the same order of improbability as the false positive. The snag comes in the semantics of the question being answered. Does the patient have the previously identified, inherited mutations in the BRCA genes? This is like knowing a certain edition of a book has a typo in a specific word on a specific line. Science has identified several of these specific typos that are commonly inherited. 23andme tells you if you have these specific typos in your book of life. It’s also possible that you acquired a typo during your lifetime, in some other word in the BRCA chapter of your genome. Or even that you inherited a typo, but one that hasn’t been identified by science yet. These are answers that 23andme can’t give you. So while they can say you don’t have typo’s X,Y, or Z, they can’t state that you have no typos at all. 23andme even states this in the copy on their site “May still have a different mutation in BRCA1 or BRCA2.”
In effect, the FDA wishes to place some barrier, either themselves or some other apparatus of medical establishment, between your genome sequence and you. They call the service that 23andme provides a “medical device.” I don’t claim that the implications of consumer genome sequencing are clear or entirely knowable at this time. I do claim that if we step in with overreaching caution and control where it is not warranted, we risk destroying a nascent industry that is going to be one of the biggest most world-shaping technologies of the next 100 years. Companies that now enjoy the freedom to innovate with new products and technologies will be forced to move elsewhere or stop.
St.Louis is becoming an important center for life-sciences, with the Genome Institute, Monsanto, the Danforth Plant Science Center, Partek, and even newer players like Cofactor Genomics. The path forward to the 21st century for these technologies does not go through slow, monolithic government bureaucracies. People free to act on their insights and recent innovations will build this industry elsewhere if we clamp down on it now.
I have recently started using Digital Ocean for hosting. I was previously using the free Heroku hosting, which I liked very much. The DNS was a bit tricky to set up, but once it got configured, it was smooth. I was able to get a one second load time for the archetyp.al index. It’s all static files, but I was still happy with Heroku, especially for free. However, I’ve been diving much deeper into web development lately, and I’ve decided I need a larger base of operations on the web, and Digital Ocean was at the top of my short list of candidates (along with dreamhost and linode). The SSD’s on every machine is what finally sold me.
I’ve been working on a rails project lately, and I need a dev machine on the cloud for testing purposes. So I found myself needing to set up a rails environment on my new Digital Ocean instance. I naturally went with my friend chris’s excellent blog on setting up rails for Ubuntu 12.10 with NginX, etc. I decided to go with MySQL instead of postgres.
This is the first Bruce Sterling talk that I encountered. Now I’m an inveterate Sterlingite, but at the time, 2006, I had only barely crossed intellectual paths with Sterling. I downloaded it with some long dead, precambrian cousin of google reader that lived in the swamps and estuaries of windows computers and survived by allowing the user to view RSS feeds on his desktop. Then pushed it to my equally antediluvian, single purpose device, an Archos mp3 player. I listened to this talk many times before the Archos died off. I thought this audio was lost to the bit bucket of history, until one day… archive.org. These people are the new Library of Alexandria, with podcasts instead of papyrus.
A month ago I was hanging out with my friend Chris. We were having our weekly meetup, talking about our approach to our work, recent experiences, and just enjoying our surroundings. I noticed a connect four game sitting in the corner, on a dark wooden ledge. The old, disjoint of additive and subtractive primary colors, blue and yellow plastic version of the game would have done the trick, but this was even better. It was a wooden version of the board. There were two wooden dowels, each half the width of the board, inserted just beneath the bottom row, which held the pieces in the board. When you draw them out, one from each side, the game is reset and the pieces fall to the bottom of the game board.
“I’m sure there’s some simple heuristics to this I’ve long since forgotten.”
Textruder is the next installment in a long line of one-dimensional cellular automata implementations on various platforms and various media. This adventure begins like so many, on the command line. The inspiration for this project came from reading one of Stephen Wolfram’s papers on cellular automata. The original output of the programs testing the concepts of cellular automata was not graphical in the sense of directly mapping each cell to a pixel or block of pixels. Instead, they simply used the command line to emulate this behavior, printing out a new line for each iteration of the row of cells, with an “*” character representing the on cells and a space for the off cells.
I recently decided to re-read a book I had read long ago, in order that I might filter it through the knowledge and experience I’ve accrued since I first read it. It occurred to me that so much of what I have done in the intervening years has an impact on my understanding of it, that I could scarcely hold a conversation with my past self on the topic. The book in question is The Selfish Gene by Richard Dawkins.