Job Description
What’s in it for you:
As a member of a Telemetry SRE team, you will have direct influence on the stability and resilience of Bloomberg systems. You will get to learn and experiment with new technologies, help drive best practices across engineering teams, work with SMEs across various disciplines, and implement changes to improve the developer experience within your own team. Other opportunities include:
- Join a group of dedicated and motivated systems and software engineers working on the backbone of Bloomberg’s Telemetry system
- Learn what it takes, from the application-level down to the network-level, to maintain highly-reliable, scalable distributed Telemetry ingestion, enrichment, alarming, and visualization
- Manage software and hardware infrastructure that processes billions of data points every day from Bloomberg data centers, client data centers, and public clouds
- Work in a highly autonomous and impact driven environment
- Encouragement to get involved in and attend industry conferences, where you will get to learn from and contribute to communities that care about observability.
We’ll trust you to:
- Understand the current system capacity and load, predict future demand and make appropriate scaling recommendations
- Define standards and best practices with respect to logging, latency, troubleshooting and monitoring
- Work with application teams to review and influence the design of software to improve its reliability
- Facilitate continuous integration / continuous deployment to automate deployment and quality control (including functional and capacity testing)
- Investigate and triage production problems as they occur
- Work with application teams deploying software both internally and to the Cloud to ensure proper observability
- Help to create dashboards, monitoring rules, and alerting rules to track the health of the live system
The technologies you’ll use:
- Languages: Python, Ruby, Go, C++
- Platforms: Linux
- Cloud Providers: Google, Microsoft, Amazon
- Infrastructure: Kafka, Kuberenetes, ElasticSearch, ScyllaDB
- Telemetry Visualization: Humio, Splunk, Grafana
You’ll need to have:
- 4+ years working with an object-oriented programming language (C/C++, Python, Java, etc.)
- A collaborative and enthusiastic attitude
- A desire to work with high performance, high availability distributed systems
- Curiosity and the ability to dig into systemic software problems, from the application layer, down to the network layer
- Experience with Linux
- A Degree in Computer Science, Engineering, Mathematics, similar field of study or equivalent work experience
We’d love to see:
- Familiarity with high-performance, high-availability distributed systems
- Experience building infrastructure and tooling to be used by other Engineering teams
- Experience working with telemetry
- Experience working with Google, Microsoft, and Amazon Cloud providers
- Experience with containerization and orchestration technologies (Docker, Kubernetes)
- Working knowledge of Chef, Prometheus, Grafana, Humio, Splunk, ElasticSearch, Kafka
- Experience with continuous integration and deployment tools (Jenkins)
- Deep understanding of TCP/IP and Unix networking
Job ID: 127340