My last stint let me work on a Blockchain based product and that too in Production. I was a part of the project from initial days from Blockchain performance testing to production support.
People, especially recruiters have always asked me if I have worked with Blockchain in Production and were interested to know further as you are, on answering yes.
I feel myself blessed to witness such opportunity that too in such a huge scale of 5 million active customer base.
Coming to the stack, we were running in our Production were:
User facing applications in Spring Cloud microservices with Zuul Gateway and Eureka registration for the services in Production. We ran all the microservices in HA(High Availability mode) having clustered all of them together at Registry service.
We also had a document storage and retrieval service which would read and write user uploaded documents. Now, since all the services were running in HA, we wanted to figure out a way to keep documents in sync(as uploads from Doc1 service should be downloadable at Doc 2 service) and were running a shared LVM storage (similar to Network Attached Storage).
Since we were running Production and Production DR in active-passive mode and there was a plan to switch over to Production DR in case its required, we were again in a soup as to what happens when Production goes down and we do not have any documents from the document microservice in Production to the ones that we plan to run in Production DR. This was solved when we planned to run a synchronization service between Production and Production DR (with Unison File Synchronizer (upenn.edu)). A bit background about this tool is that, this has the capability to synchronize 2 folders (with name specified). But a big disadvantage is that this cannot run itself on a cron job. To beat this we ran a cron job in RHEL (with crontab) to run the same command every 10 minutes, essentially having a 10 minute old backup in our Production DR.
Database was hosted in Oracle with master-slave cluster configuration in Production with a active replication in Production DR.
We also had in memory store for loading trivial user details and we were running 5 Redis masters with 1 slave each(as recommended by Redis) both in Production and Production DR.
Our architecture required partners to run Blockchain nodes too but since everyone is still stiff about adopting Blockchain, our partners were also skeptical. So as we went live on 2020, for a small duration of time we were running their part of the Blockchain node which they later gained confidence and adopted their responsibility. Now we were running Quorum nodes primarily since our requirement was to adopt private permissioned nature of some Blockchains like Hyperledger Fabric, Corda and others as liked by enterprises for various reasons which you can obviously Google.
Why Quorum? We were exploring Quorum(back then it was owned by JPMC) and ran couple of Blockchain performance tests which yield around 100 TPS with RAFT consensus. We needed just the RAFT consensus and not the Tessera since the use case was not to post any private transactions as of then along with future proofing (if ever we need private transactions).
Now in our whole stack, we were running 12 Quorum nodes each in Production and Production DR which were all running actively, since passive switchover to Production DR and starting the Quorum instantly with the backup datadir might pose problems and we did not want to take any risks (at least in Production).
Our microservices (which needed to connect to Blockchain for set and get operations) were connecting in a round robin fashion to the 12 hosted nodes which was h omegrown without any abstracted layer in between. We were connecting to the Blockchain nodes with Web3J which was well matured and had the capability to support the same.
Continuous Integration/Continuous Delivery
This was also another part of the whole puzzle where I worked intensively. Our stack included Git(locally hosted since our client was finicky about Cloud), Jenkins where all build pipelines were provided, Ansible which was used for deployment from Dev to Production with just a click on Jenkins.
No, we did not have any automated deployment every time someone pushed the code to Git, we used scripts to build the code with Maven and deploy to the required environment with Jenkins and Ansible.
Ansible Playbooks played a major role since it has multiple tasks like backing up the old deployable(in case we need that to revert the deployment), copying the deployable from the source VM where Jenkins had prepared the JARs to the destination where we wanted to deploy the same and even testing if the deployment of that specific microservice was healthy one (obviously with /health APIs). Since the whole stack was in Spring Cloud, it was mandatory to deploy the Registry service followed by the Configuration service and ending the deployment with Zuul Gateway, we made sure (in the ansible scripts) that our Registry service was running before the microservice can be started. If anything fails, we stop and revert the whole deployables as was in the environment. Now building such a script (modularizing and adapting environment specific changes be it VM challenges or SSH took time to gobble and make it ready for Production).
Ansible Vault was also a major player in the whole process as it was responsible to store all the secrets to access the environment (as we had a password based login to our VMs).
Ansible also played a major role in setting up the Blockchain nodes in every network( from preparing the folders, building the binary, deployment of each node in specific VM, to starting and peering with other nodes). It also helped set up Redis clusters with appropriate scripts.
I was a noobie in the whole CI/CD Stack when I signed up for this role, and struggled a lot (since my team specialized in software development and Blockchain and I was the only explorer out there) thereby, making a adaptable script that caters to the whole deployment cycle in every environment was really not a piece of cake.
Monitoring and Reporting
This was another piece of the whole stack as every project at least in production needs. We were also new to the whole thing somehow and I was the major contributor for the whole setup and maintenance. I researched and after rounds of discussing we decided upon ELK(ElasticSearch, Logstash, Kibana, APM), Zipkin, Shell scripts and process monitors.
ELK and APM as is known to everyone is a very versatile tool which helps not only for log monitoring but also helps in preparing charts (with dynamic data) and is well developed for reporting purposes. Therefore our choice was very good(which we realized later in production :) ).
Our log monitoring stack included Filebeat running in every VM where our Spring Cloud services are running and polled logs from the (rolling) log files generated. This was passed onto our Logstash instance to process further(as per the GROK pattern we defined as per our application logs). From there it was passed to our Elasticsearch Cluster which processed the logs (where we had minimal configuration). Now Kibana was responsible to display the log indexes prepared by Elasticsearch and had a binding with the same. We also had APM which helped to monitor the CPU, Memory statistics of every running microservice and throw an index to Elasticsearch which was again displayed in Kibana.
One major thing anyone might miss in a production system is that Elasticsearch uses too much of space for indexed data and old logs need either to be backed up and kept for recovery or deleted. Our network team helped us to backup application logs in cold storage and we had to bother less in case either of the nodes in Elasticsearch fails or we face any storage issue. We only kept a live data in Kibana for 30 days for Logstash and 15 days for APM(as it generated more data). I configured cleanup policy (with curl) for Logstash & APM in Kibana which allowed the old data to be kept for X days from the current date.
Process and service monitoring was also varied one where we monitored the Java & Quorum process with Supervisord (which to the uninitiated is the standard process monitor and process starter in case of failure). For monitoring the database we used DB scripts taken care by our DBA.
Supervisord played a very special role as one of our Java service was connected to HSM module and were behaving erratically in production. I configured the scripts to restart the microservice in case of any failure. Same was for Quorum since we can’t take any risks in Production.
Another part of the process monitoring was shooting emails in case of any failure in any part of the whole system. I used an SMTP server from our network team and prepared templated emails in case of any service unavailability and shot the same to some of the team members responsible.
Quorum Blockchain monitoring and learnings
Solution was to stop this node, plug in another node(with increased RAFT node number) and start that node up with fast sync on and raftjoinexisting flag and connecting it with the other peers. This worked as we had done prior research before going to production and had purposely failed a node and tested this.
Readers might be wondering why I did not post any architecture diagram or any commands or tutorials which can help them set up a similar architecture. The primary reason for this would be that this article is sort of my mind book which I wanted to share with everyone as to how a project(specifically a full blown Blockchain powered project) can be executed in production. I will surely post articles, tutorials in future for the curious readers and add links that will help them set up something similar.
I request you to please comment, clap (if you like this) and help me grow along with the open source community.