io - How to stream data in python without loading it all into memory at once?

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I am trying to write and read to a stream without loading everything into memory at once. Here's what I would imagine working:

import io
stream = io.BytesIO()
def process_stream(stream):
  while True:
    chunk = stream.read(5).decode('utf-8')
    if not chunk:
      return
    yield chunk
# this would be a separate thread, but here we just do it in serial:
for i in range(3):
  stream.write(b'asdf')
for chunk in process_stream(stream):
  print('I read', chunk)
But this actually doesn't print out anything.
I can get it working, but only with the following two changes, either of which requires that all the bytes are held in memory at once:
initializing stream = io.BytesIO(b'asdf' * 3) instead of incrementally writing
using stream.getvalue() instead of incrementally reading
I'm quite baffled that incremental writing can only be read by batch reading, and that incremental reading only works for batch writing. How can a get a constant-memory (assuming process_stream outpaces writing) solution working?
When you write to the stream using for loop. Your seek ends up in the last position.
asdfasdfasdf|
            ^ (Seek)            
So when you try to read, well there is nothing after the last character, therefore you get nothing when reading the stream. A solution is to reposition the seek to the beginning of the stream so you can read it. For that we can use stream.seek(0)
|asdfasdfasdf
^ (Seek after calling stream.seek(0))            
Code:
import io
stream = io.BytesIO()
def process_stream(stream, chunk_size=5):
    while True:
        chunk = stream.read(chunk_size).decode('utf-8')
        if not chunk:
            return
        yield chunk
# this would be a separate thread, but here we just do it in serial:
for i in range(3):
    stream.write(b'asdf')
stream.seek(0) # Reset the seek so it is at the beginning
for chunk in process_stream(stream):
    print('I read', chunk)
Output:
I read asdfa
I read sdfas
I read df
More information: How the write(), read() and getvalue() methods of Python io.BytesIO work?
                This helps, but I'm still looking for a constant-memory solution. It seems that whenever .write is called, the cursor is moved to the end of the stream again. For instance, if I write 'a', seek 0, read, write 'b', read, I get ''. And if I seek 0 again after writing 'b', I get 'ab'. I'm looking for a solution where the 2nd read just gives 'b', the remaining unread bytes, and 'a' is freed from memory. Is BytesIO just not the right tool?
– mwlon
                May 19, 2021 at 19:13
        
Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.