Reading data from on prem data lake to cloud storage in order to utilize cloud computing for resource heavy operations regarding NLP and ML (<10GB Total). Trying to decide if we need to utilize Google BigQuery here or if we can work directly form Google Cloud Storage with a DataProc cluster. Any thoughts here would be appreciated in regards to which would be a better approach. Thanks!

4 upvotes·17.3K views
Replies (4)
Google BigQuery

BigQuery's cost is the same as cloud storage for the storage. The cost is during the query. If you have clean data and structure, store it directly in bigquery this will be way more easier. If you have messy data or if you need to enrich them dataproc is for you

2 upvotes·3.2K views
CEO at Fashion Data·

Hello, I suggest to export your data from Big Query (it's fast and free) into a file format fitting your NLP and ML language. For instance, prefer Avro or Parquet to work with python.

2 upvotes·1.9K views
View all (4)
Avatar of Ryan Freedman