Stream remote file via Aleph to AWS S3?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

Interesting:

What I’ve got to do is to get a file stream from a remote HTTP server and then upload that stream into AWS S3.

I query a URL with http/get, parse Content-Length header to know the file size and pass that into AWS PutObjectRequest as follows:

(aws/put-object
        aws
        {:bucket-name bucket
         :key key
         :input-stream url-stream ;; the body of http/get
         :metadata {:content-length 1418264736} ;; this is parsed content-length
         :canned-acl :public-read})

The problem is, in the middle of the upload AWS fails with an error like:

com.amazonaws.SdkClientException: Data read has a different length than the expected:
dataLength=193585457; expectedLength=1418264736; includeSkipped=false;
in.getClass()=class com.amazonaws.internal.ReleasableInputStream;
markedSupported=false; marked=0; resetSinceLastMarked=false;
markCount=0; resetCount=0, 

There is a special com.amazonaws.util.LengthCheckInputStream class that counts a number of bytes read from a stream. In case the number differs from the provided content-length, it throws an error.

I may guess, the reason for that is the connection pool closes HTTP connection before the stream has been read completely. I wonder what would be the best strategy to keep the connection alive?

Right now, I passed a huge idle timeout:

(http/get url {:pool (http/connection-pool
                  {:connection-options
                   {:keep-alive? true
                    :idle-timeout (* 1000 60 60)}})})

and so far it goes well. Is there another right way to do that?

But in the end he gave up and used clj-http instead.

Post external references

  1. 1
    https://github.com/ztellman/aleph/issues/500
Source