Write-streaming to Google Cloud Storage in Python(使用 Python 向 Google Cloud Storage 写入流式传输)
问题描述
我正在尝试将用 Python 编写的
AWS Lambda
函数迁移到 CF
- 即时解压缩并逐行读取
- 对每一行进行一些灯光变换
- 将未压缩的输出(一次一行或多个块)写入 GCS
输出大于 2GB - 但略小于 3GB,因此它适合 Lambda
,just.
嗯,这似乎是不可能的,或者更多地涉及到 GCP
:
- 未压缩的文件无法放入内存或
/tmp
- 在撰写本文时限制为 2048MB - 所以 Python 客户端库upload_from_file
(或_filename
)无法使用 - 有 this 官方文件,但令我惊讶的是,它指的是
boto
,一个最初为AWS S3
设计的库,并且由于boto3
已经存在一段时间了,所以它已经过时了.没有真正的GCP
方法来流式写入或读取 - Node.js 有一个简单的
createWriteStream()
- 不错的文章 这里 btw - 但在 Python 中没有等效的单行代码 - 可恢复媒体上传声音喜欢它,但在 Node 中处理的东西很多代码更容易
- AppEngine 有 cloudstorage 但在它之外不可用 - 并且已过时
- 在工作包装器上几乎没有示例,用于逐行编写文本/纯文本数据,就好像
GCS
是本地文件系统一样.这不仅限于Cloud Functions
和 Python 客户端库的缺失功能,但由于资源限制,它在 CF 中更为严重.顺便说一句,我参与了 讨论 以添加可写的 IOBase 函数,但是它没有牵引力. - 显然使用 VM 或
DataFlow
对手头的任务来说是不可能的.
在我看来,从基于云的存储中读取/写入的流(或类似流)甚至应该包含在 Python 标准库中.
按照当时的建议,仍然可以使用 GCFS,它在幕后提交了在您将内容写入 FileObj 时为您分块上传.同一团队编写了 s3fs
.我不知道 Azure.
AFAIC,我会坚持使用 AWS Lambda
,因为输出可以容纳在内存中 - 目前 - 但分段上传是支持任何输出大小且内存最少的方法.p>
想法或替代方案?
我对 multipart
与 resumable
上传感到困惑.后者是您流式传输"所需要的——它实际上更像是上传缓冲流的块.
Multipart
上传是在同一个API调用中一次加载数据和自定义元数据.
虽然我非常喜欢 GCFS - Martin,他的主要贡献者非常有反应——我最近发现 一种替代方法,它使用 google-resumable-media
库.
GCFS
建立在核心 http API 之上,而 Seth 的解决方案使用由 Google 维护的低级库,与 API 更改更加同步,其中包括指数备份.后者对于大/长流来说确实是必须的,因为连接可能会中断,即使在 GCP
内也是如此 - 我们遇到了 GCF
的问题.
最后,我仍然相信 GoogleCloud Library 是添加类似流的功能的正确位置,具有基本的 write
和 read
.它具有 核心代码已经.
如果您也对核心库中的该功能感兴趣,请点赞这里 - 假设优先级基于此.
I am trying to migrate an AWS Lambda
function written in Python
to CF that
- unzips on-the-fly and read line-by-line
- performs some light transformations on each line
- write output (a line at a time or chunks) uncompressed to GCS
The output is > 2GB - but slightly less than 3GB so it fits in Lambda
, just.
Well, it seems impossible or way more involved in GCP
:
- uncompressed cannot fit in memory or
/tmp
- limited to 2048MB as of writing this - so Python Client libupload_from_file
(or_filename
) cannot be used - there is this official paper but to my surprise, it's referring to
boto
, a library initially designed forAWS S3
, and a quite outdated one sinceboto3
is out for some time. No genuineGCP
method to stream write or read - Node.js has a simple
createWriteStream()
- nice article here btw - but no equivalent one-liner in Python - Resumable media upload sounds like it but lot of code for something handled in Node much easier
- AppEngine had cloudstorage but not available outside of it - and obsolete
- little to no example out there on a working wrapper for writing text/plain data line-by-line as if
GCS
was a local filesystem. This is not limited toCloud Functions
and a lacking feature of the Python Client library, but it is more acute in CF due the resource constraints. Btw, I was part of a discussion to add a writeable IOBase function but it had no traction. - obviously using a VM or
DataFlow
are out of question for the task at hand.
In my mind, stream (or stream-like) reading/writing from cloud-based storage should even be included in the Python standard library.
As recommended back then, one can still use GCSFS, which behind the scenes commits the upload in chunks for you while you are writing stuff to a FileObj.
The same team wrote s3fs
. I don't know for Azure.
AFAIC, I will stick to AWS Lambda
as the output can fit in memory - for now - but multipart upload is the way to go to support any output size with a minimum of memory.
Thoughts or alternatives ?
I got confused with multipart
vs. resumable
upload. The latter is what you need for "streaming" - it's actually more like uploading chunks of a buffered stream.
Multipart
upload is to load data and custom metadata at once, in the same API call.
While I like GCSFS very much - Martin, his main contributor is very responsive -, I recently found an alternative that uses the google-resumable-media
library.
GCSFS
is built upon the core http API whereas Seth's solution uses a low-level library maintained by Google, more in sync with API changes and which includes exponential backup. The latter is really a must for large/long stream as connection may drop, even within GCP
- we faced the issue with GCF
.
On a closing note, I still believe that the Google Cloud Library is the right place to add stream-like functionality, with basic write
and read
. It has the core code already.
If you too are interested in that feature in the core lib, thumbs up the issue here - assuming priority is based thereon.
这篇关于使用 Python 向 Google Cloud Storage 写入流式传输的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:使用 Python 向 Google Cloud Storage 写入流式传输


- 计算测试数量的Python单元测试 2022-01-01
- 如何使用PYSPARK从Spark获得批次行 2022-01-01
- 我如何透明地重定向一个Python导入? 2022-01-01
- 使用公司代理使Python3.x Slack(松弛客户端) 2022-01-01
- YouTube API v3 返回截断的观看记录 2022-01-01
- 我如何卸载 PyTorch? 2022-01-01
- 使用 Cython 将 Python 链接到共享库 2022-01-01
- ";find_element_by_name(';name';)";和&QOOT;FIND_ELEMENT(BY NAME,';NAME';)";之间有什么区别? 2022-01-01
- 检查具有纬度和经度的地理点是否在 shapefile 中 2022-01-01
- CTR 中的 AES 如何用于 Python 和 PyCrypto? 2022-01-01