Skip to main content
roshanrosareo

9 months ago

GCP pipeline for min service cost and input data vary?

Which is the pipeline for json messages to process from PubSub to BigQuery to use for minimum service cost, and input data volume with variable size and minimal manual intervention. I believe Dataflow is the service to go with min service cost, and with its default autoscaling feature for  variable size of input data volume  -  throughput based on few Online documentation. Please help. I also saw an option for Dataproc with diagnose command, but i don't think that diagnose is used for this purpose.

Image of mattulasien
mattulasien
9 months ago
Dataproc is only to be used if you are currently utilizing Hadoop/Spark. If you are not using Hadoop/Spark, then Dataproc is not even an option. Google's official recommendation for new big data processing pipelines is to use Dataflow, with Dataproc recommended if you are already using Hadoop/Spark and want to keep using it.

Dataflow only charges resources when a processing pipeline is in progress, then shuts down once processing is complete, saving costs compared to a cluster that is always up regardless of whether or not it is being utilized. For that reason, Dataflow would be a better choice.
Image of roshanrosareo
roshanrosareo
9 months ago
Thanks