DataStreamer Performance FAQs
This document answers common questions about DataStreamer's performance, resource requirements, and how to handle different data workloads effectively.
Q1: How do I size my DataStreamer deployment? What are the key factors?
Sizing a DataStreamer deployment is based on a throughput-driven methodology, not arbitrary component counts. The two primary resources to plan for are CPU and Memory, and they are influenced by four key factors:
Throughput (EPS & Event Size): The total volume of data you process, measured in Events Per Second (EPS) and the average size of those events.
Data Structure: Whether your data is structured (like JSON) or unstructured (plain text) has the single biggest impact on CPU usage.
Transformation Complexity: The number and type of transformations (parsing, filtering, enrichment) you apply to your data.
High Availability (HA) Needs: The level of redundancy required, which influences the minimum number of instances.
Our official Sizing & Capacity Planning Guide provides detailed formulas and step-by-step calculations to help you determine your exact needs.
Q2: Why aren't all logs created equal? How does data structure affect performance?
Not all logs are equal because the work required to process them varies dramatically. The primary difference lies in structured vs. unstructured data.
Structured Logs (e.g., JSON) are machine-readable out of the box. DataStreamer can parse them with minimal CPU overhead because the fields and values are already defined. This is a highly efficient, low-cost operation.
Unstructured Logs (e.g., plain text, syslog) require CPU-intensive parsing to extract meaningful fields. This often involves complex regular expressions (Regex) or Grok patterns that must be executed on every single log line. This parsing work is the most computationally expensive part of a data pipeline.
This table, based on a 6,000 EPS workload, shows how CPU requirements decrease as the percentage of structured data increases:
1
100% / 0%
8 vCPU
8 GB
2
70% / 30%
6 vCPU
8 GB
3
50% / 50%
5 vCPU
8 GB
4
10% / 90%
4 vCPU
8 GB
As you can see, moving from a fully unstructured to a mostly structured workload can reduce your CPU needs by 50%, while memory remains stable.
Q3: You say DataStreamer is CPU-intensive. Why is that?
DataStreamer itself is incredibly lightweight. The "intensive" part is the data transformation work it performs. Ingestion and routing are cheap, but parsing, filtering, and enriching data are computationally expensive by nature.
CPU-Intensive Tasks: Regex parsing, GeoIP lookups, complex filtering logic, and data enrichment.
DataStreamer is considered CPU-intensive because it is so efficient at its core that the primary performance bottleneck is almost always the complexity of the work you ask it to do, not the tool itself. This is a key difference from other tools where the runtime (like the JVM) consumes a significant portion of the CPU before any work is even done.
Q4: Does each pipeline I create require its own dedicated CPU core?
No, this is a common misconception. DataStreamer is built on a modern, asynchronous runtime (Tokio) that is far more efficient than a "thread-per-pipeline" model.
Shared Thread PoolDataStreamer creates a small pool of worker threads, typically equal to the number of CPU cores available on the machine.Asynchronous TasksEach pipeline, along with its sources, transforms, and sinks, runs as a lightweight asynchronous task.MultiplexingThe runtime efficiently schedules (multiplexes) hundreds or thousands of these async tasks onto the shared thread pool.A task only uses a thread when it has actual work to do. If a pipeline is waiting for data from a network socket, it yields control of the thread, allowing another pipeline to perform its work. This results in extremely high CPU utilization and efficiency.
Q5: If pipelines don't consume a CPU core each, why does the number of pipelines matter?
The number of pipelines is important for logical organization, state management, and deployment strategy, not for raw CPU allocation.
Memory Consumption: Each pipeline maintains its own internal buffers and state. Therefore, more pipelines will consume more memory, even if they are processing little data.
State Management: Each pipeline represents an independent data flow. More pipelines mean more states for DataStreamer to manage, which adds a small amount of overhead.
Isolation & Blast Radius: Running different data flows in separate pipelines ensures that an error or backpressure in one pipeline (e.g., a failing output) does not affect the others.
Deployment Strategy: When deploying in Kubernetes, you might limit the number of pipelines per pod (e.g.,
max_pipelines_per_pod: 2) to ensure a good balance between resource consolidation and fault isolation.
Think of pipelines as a way to organize your work, not as a unit of performance.
Q6: Why would two pipelines with the same EPS have different resource needs?
This is the crucial takeaway: throughput (EPS) is only one part of the equation. The complexity of the work performed within the pipeline is what truly determines the resource requirements.
Consider two pipelines, both handling 1,000 EPS:
Pipeline A (Light Workload):
Receives pre-structured JSON data.
Applies a simple filter to drop events with a certain field.
Forwards the data to S3.
Resource Needs: Very low CPU, moderate memory.
Pipeline B (Heavy Workload):
Receives unstructured syslog data.
Applies 5 complex Regex patterns to parse out dozens of fields.
Performs a GeoIP lookup on the source IP address.
Adds 3 new fields based on the parsed data.
Forwards the data to Splunk and Kafka.
Resource Needs: Very high CPU, moderate-to-high memory.
Even with the same EPS, Pipeline B could require 5-10x more CPU than Pipeline A because the work it is doing is fundamentally more complex. This is why our sizing methodology focuses on both throughput and transformation complexity.
Q7: How does DataStreamer handle sudden load spikes? Will it drop data?
DataStreamer is designed with backpressure management and buffering to handle load spikes gracefully. Here's how it works:
Internal BuffersDataStreamer maintains in-memory buffers between each stage of the pipeline (source → transform → sink). When a downstream component (like a sink) is slower than the upstream components, the buffer absorbs the difference.Backpressure PropagationIf the buffer fills up, DataStreamer applies backpressure upstream. For example, if your sink to Splunk is slow, DataStreamer will slow down reading from the source (e.g., a socket) to prevent overwhelming the system.Disk Buffers (Optional)For mission-critical data, you can configure disk-based buffers. If in-memory buffers fill up, DataStreamer will write data to disk, ensuring no data loss even during extended outages of downstream systems.Graceful DegradationIf configured with appropriate buffer sizes, DataStreamer can handle spikes that are several times your normal throughput for short periods (seconds to minutes).
Best Practice: Size your buffers based on the expected duration and magnitude of load spikes. Our sizing guide includes a formula: Buffer Memory (GB) = Throughput (MiB/s) × 0.1 as a baseline for a 100ms buffer.
Q8: What happens if I under-provision resources? Will DataStreamer crash?
DataStreamer will not crash, but you will experience performance degradation. Here's what happens:
CPU Starvation: If you don't provide enough CPU, DataStreamer will simply process data more slowly. Your throughput will be lower than expected, and latency will increase. You might see buffers filling up and backpressure being applied.
Memory Pressure: If you don't provide enough memory, the operating system will start swapping to disk, which is extremely slow. This will cause severe performance degradation. In extreme cases, the OS might kill the DataStreamer process (OOM - Out of Memory).
Increased Latency: Under-provisioned systems will have higher end-to-end latency as data waits in buffers or for CPU time.
Recommendation: Always apply the safety factors we recommend (30% CPU headroom, 40% memory headroom) to account for unexpected load and ensure smooth operation.
Q9: How do I scale DataStreamer horizontally? When should I add more instances?
DataStreamer is designed for horizontal scaling. You should add more instances when:
CPU utilization consistently exceeds 70% on your existing instances.You need to process more data than a single instance can handle (typically beyond 100,000 EPS per instance).You require higher availability and want to distribute the load across multiple instances for redundancy.
Scaling Strategies:
Kubernetes: Increase the replica count in your Deployment. The Horizontal Pod Autoscaler (HPA) can automatically scale based on CPU or custom metrics.
Docker: Increase the
replicascount in yourdocker-compose.ymlfile.Bare Metal: Deploy additional DataStreamer instances on separate servers and place a load balancer (NGINX, HAProxy) in front of them.
Load Balancing: For sources that accept connections (like HTTP or socket sources), place a load balancer in front of your DataStreamer instances. For sources that pull data (like file or S3), ensure each instance is configured to read from a different subset of the data.
Q10: Should I run one large DataStreamer instance or multiple smaller ones?
This is a classic trade-off between consolidation and isolation. The answer depends on your priorities:
One Large Instance (Consolidated):
Pros: Simpler to manage, fewer moving parts, more efficient resource utilization.
Cons: Larger blast radius (if it fails, all pipelines go down), harder to scale individual pipelines, potential resource contention.
Best For: Homogeneous workloads, non-critical data, development/testing environments.
Multiple Smaller Instances (Isolated):
Pros: Fault isolation (one failure doesn't affect others), independent scaling, easier to troubleshoot, better for multi-tenancy.
Cons: More operational overhead, potentially less efficient resource usage, more instances to monitor.
Best For: Critical data flows, heterogeneous workloads, production environments with strict SLAs.
Our Recommendation: In Kubernetes, use a balanced approach with 2-4 pipelines per pod. This provides a good balance between operational simplicity and fault isolation. For critical pipelines (e.g., security logs, compliance data), consider dedicated pods with max_pipelines_per_pod: 1.
Q11: How does DataStreamer compare to Fluentd and Logstash in terms of resource usage?
DataStreamer is significantly more efficient than both Fluentd and Logstash. Here's a direct comparison based on independent benchmarks:
DataStreamer
25 MiB/s (structured)
1x (baseline)
1x (baseline)
Fluentd
15 MiB/s (structured)
1.5-2.0x
3-5x
Logstash
10 MiB/s (structured)
2.0-3.0x
4-6x
Why the Difference?
DataStreamer is built in Rust, a modern systems programming language that compiles to a native binary. There is no runtime overhead.
Fluentd is built in Ruby, an interpreted language. It requires the Ruby interpreter to run, which adds significant CPU and memory overhead.
Logstash is built in Java and runs on the JVM (Java Virtual Machine). The JVM itself requires 4-8 GB of heap memory just to start, and garbage collection pauses can cause performance issues.
Real-World Impact: For the same workload, DataStreamer can reduce your infrastructure costs by 50-70% compared to Logstash. You can either process the same data with a fraction of the hardware or handle 2-3x more data on your existing infrastructure.
Q12: I have 12 pipelines processing different types of logs. How do I calculate my total resource needs?
To calculate your total resource needs for multiple pipelines, follow this process:
Calculate per-pipeline requirementsFor each pipeline, determine its EPS, event size, data structure mix, and transformation complexity. Use our sizing formulas to calculate the CPU and memory needs for that specific pipeline.Sum the requirementsAdd up the CPU and memory needs across all 12 pipelines. This gives you the total resources required.Apply safety factorsMultiply the total CPU by 1.3 (30% headroom) and total memory by 1.4 (40% headroom).Determine deployment strategyBased on the total resources, decide how to distribute the pipelines across pods/containers/instances. Use the pod count calculation from our sizing guide.Example: If your 12 pipelines collectively require 15 vCPU and 20 GB RAM (after safety factors), and you set max_cpu_per_pod: 8 and max_memory_per_pod: 16, you would need at least 2 pods (based on CPU: 15 / 8 = 1.875, round up to 2).
Important: The number of pipelines itself does not directly translate to CPU cores. What matters is the total throughput and transformation complexity across all pipelines.
Q13: Can I use DataStreamer's AI-powered parser to reduce my resource needs?
Yes! DataStreamer's AI-powered parser can significantly reduce the operational burden and, in some cases, the resource overhead of building and maintaining complex parsers.
How It Helps:
Faster Time-to-Value: Build parsers in seconds, not weeks. This means you can start processing data faster without the upfront investment in parser development.
Optimized Parsing: The AI-generated parsers are often more efficient than hand-written Regex patterns because they are designed to extract only the necessary fields.
Reduced Maintenance: As log formats change, you can regenerate the parser quickly instead of manually debugging and updating complex Regex patterns.
Resource Impact: While the AI parser itself doesn't reduce the CPU cost of parsing (parsing is still parsing), it ensures you are not over-parsing (extracting more fields than you need) or using inefficient patterns. This can lead to modest CPU savings (5-15%) compared to poorly optimized manual parsers.
Best Practice: Use the AI parser to quickly prototype and deploy parsers, then monitor their performance. If a particular parser is a bottleneck, you can always optimize it further.
Q14: What are the most common mistakes when sizing DataStreamer deployments?
Based on our experience, here are the most common sizing mistakes:
Counting Pipelines Instead of ThroughputAssuming each pipeline needs a dedicated core. This leads to massive over-provisioning. Fix: Use throughput-based sizing.Ignoring Data StructureTreating all logs the same. Unstructured logs require significantly more CPU than structured logs. Fix: Account for the percentage of unstructured data in your workload.Under-Provisioning MemoryFocusing only on CPU and skimping on memory. This leads to swapping and severe performance degradation. Fix: Follow our memory sizing formulas and apply the 40% safety factor.Not Planning for SpikesSizing for average load instead of peak load. Fix: Size for peak load or configure adequate buffers to absorb spikes.Skipping Safety FactorsDeploying with exactly the calculated resources, leaving no headroom. Fix: Always apply the 30% CPU and 40% memory safety factors.Over-ConsolidationRunning too many pipelines in a single pod/instance to "save resources." This creates a large blast radius and makes troubleshooting difficult. Fix: Use a balanced approach (2-4 pipelines per pod).
Q15: Where can I find more detailed information on sizing and deployment?
We provide comprehensive documentation to help you size and deploy DataStreamer effectively:
DataStreamer Sizing & Capacity Planning Guide: Step-by-step formulas, real-world scenarios, and detailed calculations.
DataStreamer Production Deployment Guide: Best practices for Kubernetes, Docker, and bare metal deployments.
DataStreamer Performance Benchmark: Comparative analysis vs. Fluentd and Logstash with independent benchmark data.
Resource Calculator (Excel): Interactive calculator where you can input your parameters and get instant sizing recommendations.
All of these resources are available in our documentation portal. If you have specific questions or need assistance with a complex deployment, please contact our support team.
Conclusion
DataStreamer's performance and resource requirements are driven by the complexity of the work you ask it to do, not by arbitrary component counts. By understanding the relationship between data structure, transformation complexity, and throughput, you can accurately size your deployment and achieve optimal performance and cost-efficiency.
The key takeaways are:
While Throughput-based sizing is the correct approach.
Data structure (structured vs. unstructured) is the single biggest factor in CPU usage.
Pipelines are logical units, not CPU cores.
Transformation complexity determines resource needs, not just EPS.
Safety factors are critical for production stability.
With this knowledge, you can confidently deploy DataStreamer at any scale.
Last updated