Pintu is offering an opportunity for a full-time Site Reliability to join our Exchange SRE Team. The individual in this role will experience running complex geographically distributed Cloud setups that are serving a large number of client connections, both ad-hoc and streaming ones.
This position requires outstanding technical proficiency, professionalism, solid communication, exceptional problem-solving skills, and an eager attitude.
The successful candidate will play a key role in building, operating, and evolving an error-free, low-latency, high capacity, and throughput next-gen Crypto Exchange, its matching engines or back-end software systems that serve millions of customers (retail or institutional investors, B2B2C clients, market makers, etc.).
The ideal candidate should be knowledgeable in the trading technologies domain, infrastructure as code concepts, various orchestration engines and containerization technologies, monitoring engines, and stacks, and have familiarity with high-performance computing and networking.
Strong written and oral communication is a must, as the applicant will frequently be interacting with the business stakeholders and product teams to achieve Pintu's strategic business goals.
Essential Functions / Responsibilities
Analyze Business/Product requirements and propose effective and efficient technical solutions in delivering changes and innovations to the Pintu Exchange infrastructure and landscape
Work with a project focus group (product engineering, product management, architecture, and CTO) to compile a work breakdown structure of tasks for given deliverables and provide realistic estimates for completion or project assignments
Design, build, maintain and improve Pintu’s Exchange infrastructure and respective tooling. Ensure infrastructure elasticity and automated scalability for cost-efficiency in resources utilization while ensuring the system’s high availability and fault tolerance
Collaborate with other Developers, SREs, and QA Engineers to execute full-cycle integration, functional, and regression testing. Own and resolve all priority defects identified within the solution codebase efficiently and in a timely fashion
Promote software changes across all environments, safely and responsibly, through Development, Staging environments to deploying updates to the Production environment in a zero-downtime manner
Provide effective infrastructure Level 1 technical support during business and, occasionally, off hours depending on a rotation schedule. Design, build, maintain and improve the respective infrastructure monitoring tooling that is critical for both:momentum situational awareness and pro-active incident response
future infrastructure capacity planning activities
Participate in team exercises to identify and implement areas for continuous improvement, and be proactive in bringing your ideas across
Educate and mentor your engineering colleagues in the areas of your own expertise and domain knowledge, and be open-minded and approachable
Experience Required
5+ years of SRE experience, ideally working with Amazon Web Services and Google Cloud environment. MS Azure.
Experience in designing and implementing AWS and/or GCP setup from scratch
Experience building and running cross-regional resilient solutions
Experience in architecting, building, deploying, and operating enterprise-ready container solutions on Kubernetes
Solid experience in setting up and maintaining message broker infrastructure (Kafka, RocketMQ, etc.)
Experience in setting up Cloud Persistence layer (AWS Aurora, GCP BigQuery, etc.)
Experience implementing large Service mesh via Istio or any other relevant solution
Experience building on-demand, short-lived environments (for debugging, profiling, and load-testing scenarios)
Experience working in small focus teams of high-skilled engineers
Necessary Skills
Solid understanding of Cloud networking concepts (VPC, peering, interconnects, etc.)
Good understanding of Cloud Security principles (VPN, Application Firewall(s), IAM, etc.)
Experience with operating systems, especially good knowledge of the Linux operating system and understanding of network architectures
Have deep knowledge of Docker and Kubernetes
Solid knowledge of Bash, Ansible, and Terraform scripting
Well-versed in using SDLC CI/CD pipelines for automated infrastructure management of large-scale system deployments
Excellent written and verbal communication skills
An energetic, creative, and autonomous self-starter
Preferred/Bonus Skills
Knowledge of Makefiles
Knowledge of Python and the respective libraries
Hand-on experience working with Cloudfare Enterprise stack
Knowledge of TCP/IP and UDP networking protocols
Experience in infrastructure performance and chaos testing
Experience working with GitHub Actions
Hands-on experience and knowledge of Hardware Security Modules (HSM) or hardware enclave solutions.
Experience in financial technology, with crypto and/or traditional financial know-how a strong plus