Robert Szczepanowski
Robert Szczepanowski
Senior Software Engineer

The Memory of Water: Why LSTMs Demand Polished Data

Jan 14, 20264 min read

In the era of "Big Data," there is a pervasive myth in environmental science that quantity is a proxy for quality. We assume that if we have terabytes of telemetry logs from thousands of sensors, the sheer volume of information will overpower the noise. We assume that modern Deep Learning architectures—specifically Long Short-Term Memory (LSTM) networks—are smart enough to figure it out.

They are not.

In hydrology, raw data is not fuel; it is crude oil. It is full of impurities, gaps, and artifacts that, if fed directly into a neural network, will clog the engine. When building systems to predict flash floods or manage reservoir levels, the sophistication of your model architecture matters far less than the continuity and physical integrity of your input data.

We don't just need to "clean" data. We need to polish it.

The Illusion of Abundance

A modern hydrological sensor network is a chaotic environment. Pressure transducers drift as sediment builds up. Telemetry radios fail during the very storms we need to measure. Batteries die in the cold.

When you look at a raw dataset, you see a time series. But an LSTM sees a narrative. If that narrative is riddled with holes, spikes, and flatlines, the model cannot learn the underlying physics of the catchment.

We often see feeding raw sensor logs into training pipelines, hoping the neural network will learn to ignore the errors. This is a fundamental misunderstanding of how LSTMs work. A standard regression model might average out the noise. An LSTM, however, tries to learn the sequence of events. If we feed it noise, it doesn't just make a bad prediction for that timestep; it learns a false causal relationship that corrupts its understanding of future events.

The High Cost of Discontinuity

To understand why data polishing is critical, you have to understand the "Memory" in Long Short-Term Memory.

Unlike a standard feed-forward network that looks at a snapshot of data, an LSTM maintains an internal "cell state"—a vector that carries context forward through time. In hydrology, this cell state represents the physical state of the catchment: How saturated is the soil? How high is the groundwater? Is the river already swollen from yesterday's rain?

Data continuity is the lifeline of this cell state.

When a sensor goes offline for three hours, we don't just lose three hours of data. We sever the model's connection to the past. If we simply drop those rows and stitch the time series back together, we are teleporting the catchment three hours into the future instantly. The LSTM sees a sudden, inexplicable jump in state that violates the laws of physics.

It tries to learn a pattern to explain this jump. But there is no pattern—only a broken sensor. The result is a model that "hallucinates," predicting sudden floods or droughts based on data artifacts rather than meteorological forcing.

Building the Pipeline with Dagster

To solve this, we cannot rely on ad-hoc cleaning scripts scattered across Jupyter notebooks. We need a rigorous, reproducible engineering standard. This is where we leverage Dagster to orchestrate the transformation from chaos to clarity.

In our architecture, we treat data stages as distinct software-defined assets.

First, we define a raw_sensor_ingestion asset. Dagster pulls this directly from our telemetry APIs or S3 buckets. This asset is immutable; it represents the "ground truth" of what the sensors actually reported, warts and all. We never modify this layer, ensuring we always have a pristine audit trail.

Next, we define a downstream polished_timeseries asset. This is where the engineering happens. Dagster manages the dependency, ensuring that the polishing logic only runs when new raw data is available. Inside this asset, we execute our cleaning algorithms—removing outliers, handling gaps, and normalizing timestamps.

By using Dagster, we gain full lineage. If a model starts behaving strangely, we don't have to guess which cleaning script was run. We can look at the asset graph and see exactly which version of the code produced the training data, ensuring that our "polish" is as version-controlled as our model architecture.

Enforcing the Laws of Physics on Data

The logic inside that polished_timeseries asset is designed to enforce the laws of physics. A neural network starts as a blank slate; it doesn't know that water cannot flow uphill or that a river cannot dry up in seconds.

We must teach it these boundaries through rigorous checks:

  1. Physical Bounds: A river stage cannot be negative. Soil moisture cannot exceed porosity. Precipitation cannot physically reach 500mm in 10 minutes. These aren't just outliers; they are impossibilities.
  2. Temporal Consistency: Water has mass and momentum; it accelerates and decelerates according to gravity and friction. A reading that jumps from 1m to 5m and back to 1m in a single 15-minute interval is almost certainly a sensor glitch, not a flash flood.

If we leave these "ghost signals" in the training set, the LSTM wastes its capacity trying to model impossible physics. By removing them, we allow the model to focus its gradient descent on learning the actual behavior of water.

Filling the Void Without Lying to the Model

Once we identify the gaps and the ghosts, we face the hardest choice in data engineering: Imputation. How do we fill the silence without lying to the model?

This is where domain expertise becomes code.

  • Linear Interpolation might work for temperature, which changes gradually.
  • Forward Filling might work for a reservoir level that changes slowly.
  • Masking is often the most honest approach for precipitation. If we don't know if it rained, we shouldn't guess. We should explicitly tell the model, "I don't know," often by using a separate boolean channel in the input tensor indicating data validity.

The danger of aggressive polishing is creating a "perfect" dataset that doesn't exist in reality. If we smooth out every peak and fill every gap with a perfect average, we train a model that is terrified of extremes. It will under-predict floods because it has never seen the raw, jagged reality of a storm.

Respecting the Journey of the Data

In the rush to adopt the latest Transformer architectures or state-of-the-art LSTMs, it is easy to view data processing as a janitorial task—something to be automated away so we can get to the "real work" of modeling.

But in environmental science, the data is the real work.

The performance ceiling of any hydrological forecast is not determined by the number of layers in your neural network, but by the fidelity of the story your data tells. A simple model trained on polished, physically consistent data will outperform a complex model trained on raw noise every time.

We are not just training models to predict numbers. We are training them to understand the memory of water. And that memory must be clear.