Someone once said, “Life is better at the lake.” That may be true in the physical world, but when it comes to using data lakes for observability, that’s not typically the case.
Just about every organization in every industry has a digital presence, whether it’s a website, an app or another digital service. This makes observability a universal need; it helps teams monitor software performance, optimize performance, detect potential security issues and quickly monitor for bugs if and when they occur. These responsibilities typically fall on engineering teams and require as close to real-time response as possible.
Some organizations also have business intelligence (BI) needs, aimed at generating insights from multiple data sources for business decision-making. These responsibilities typically fall on business decision-makers; they are not real-time decisions, but rather made by looking back over long time periods.
All-In on a Data Lake Strategy
Companies that go all-in on a data lake strategy may ignore the needs of the engineering teams working on the real-time mission-critical aspects of the business and force all of the company’s data, queries, dashboards, etc. into a data lake solution. That doesn’t actually meet the needs of the engineering teams. The result? Engineering teams use unsanctioned solutions, fracturing the company’s data strategy from the ground up.
By using a standards-based observability solution built on open source InfluxDB 3.0, companies can give their engineers the observability tools they need while preserving their data lake strategy.
Here are three main challenges that teams come across when using data lakes for time-series, and how to solve the problem with InfluxDB 3.0.
Problem One: Inefficient and Slow Ingest
Data sources generate time-series data at an incredibly high rate, easily producing hundreds or thousands of data points every second; some sources even do so as quickly as every nanosecond. Data lakes can store massive volumes of data, but they aren’t optimized to quickly ingest and make it readable on a millisecond time scale.
Data lake ingest jobs can be measured in minutes or even hours, which is far too late for any meaningful interventions that can prevent catastrophes such as downtime. Customers will notice the outage before the data is even ingested into the data lake, much less acted upon.
Problem Two: Time-Series Life Cycle
Full-fidelity data is necessary for real-time analytics and observability, but the value of the full-fidelity data fades over time. Storing the full-fidelity data in a data lake may be convenient, but it is costly and not necessary for typical BI purposes.
InfluxDB is designed for the time-series data life cycle. This life cycle entails:
1. The leading edge of data. That is, data written in the last two hours or so. This is the data that is most likely to be queried, so the full fidelity data is kept in memory ready to query instantly.
2. Compacted data. That is, data older than the leading edge that is still likely to be queried and requires full fidelity is compacted off to disk in object store, which still permits fast querying but does not incur the costs of storing data in memory.
3. Retention policies. That is, data that has lost its value in its full fidelity form. For example, for observability data, data older than a few weeks can be aggregated into a much smaller data footprint. Then, this aggregated data is ingested into the company’s data lake while the full fidelity data is automatically dropped from InfluxDB 3.0.
Without such a system, it is necessary to store the full fidelity data in the data lake and incur those massive costs without a reasonable way to manage the life cycle of the data. With such a system, the correct amount of observability data can be ingested into your data lake to support your BI needs.
Problem Three: Slower Query Performance
Engineers can experience a high degree of frustration when they are troubleshooting an issue in real-time, but their dashboards won’t load or their queries won’t return in a reasonable amount of time. Data lakes simply are not designed to produce fast results for observability, and so query performance is a primary complaint of engineers trying to use a data lake for observability. Waiting minutes for a query to return when you are running a BI report may be annoying, but it could spell extended downtime for an engineering team.
Fortunately, InfluxDB is built on the Apache Arrow standards, which means that it can be directly queried by your engineering team. Via protocols such as Delta Sharing and Iceberg, anyone can query into InfluxDB 3.0 from their data lake and join the real-time observability data stored in InfluxDB 3.0 with the data lake data. This is akin to having your cake and eating it too.
The Bottom Line
If you are trying to use a data lake strategy to fulfill your engineering team’s observability needs, you are going to frustrate those teams because the data lake is not designed for data ingestion, the data life cycle, or the query needs of those teams. This will cause those teams to create rogue solutions that are inconsistent with your company’s data lake strategy.
By providing those teams with open source InfluxDB 3.0, engineering teams will have a first-class observability platform that is designed to interoperate sensibly with your company’s data lake. This enables the critical mission of your observability platform while maintaining your overall data lake strategy for your BI needs.