Amazon S3

Amazon S3

Application and Data / Data Stores / Cloud Storage
Needs advice
on
Amazon S3Amazon S3DremioDremio
and
SnowflakeSnowflake

Trying to establish a data lake(or maybe puddle) for my org's Data Sharing project. The idea is that outside partners would send cuts of their PHI data, regardless of format/variables/systems, to our Data Team who would then harmonize the data, create data marts, and eventually use it for something. End-to-end, I'm envisioning:

  1. Ingestion->Secure, role-based, self service portal for users to upload data (1a. bonus points if it can preform basic validations/masking)
  2. Storage->Amazon S3 seems like the cheapest. We probably won't need very big, even at full capacity. Our current storage is a secure Box folder that has ~4GB with several batches of test data, code, presentations, and planning docs.
  3. Data Catalog-> AWS Glue? Azure Data Factory? Snowplow? is the main difference basically based on the vendor? We also will have Data Dictionaries/Codebooks from submitters. Where would they fit in?
  4. Partitions-> I've seen Cassandra and YARN mentioned, but have no experience with either
  5. Processing-> We want to use SAS if at all possible. What will work with SAS code?
  6. Pipeline/Automation->The check-in and verification processes that have been outlined are rather involved. Some sort of automated messaging or approval workflow would be nice
  7. I have very little guidance on what a "Data Mart" should look like, so I'm going with the idea that it would be another "experimental" partition. Unless there's an actual mart-building paradigm I've missed?
  8. An end user might use the catalog to pull certain de-identified data sets from the marts. Again, role-based access and self-service gui would be preferable. I'm the only full-time tech person on this project, but I'm mostly an OOP, HTML, JavaScript, and some SQL programmer. Most of this is out of my repertoire. I've done a lot of research, but I can't be an effective evangelist without hands-on experience. Since we're starting a new year of our grant, they've finally decided to let me try some stuff out. Any pointers would be appreciated!
READ MORE
6 upvotes·80.2K views
Needs advice
on
Amazon S3Amazon S3
and
MySQLMySQL

Hi, I'm working on a project to integrate dat from Shopify (e-commerce platform) to Amazon Quicksight. I'm thinking about which database to use, either Amazon S3 or MySQL.

READ MORE
3 upvotes·24.7K views
Needs advice
on
Heroku PostgresHeroku Postgres
and
MongoDBMongoDB

Hi team, we are building up a ecommerce marketplace with both sides (sellers and buyers), now, we have made the forntend dev, but we are facing the choice of database selection. we would use Amazon S3 as cloud storage, but hard to choose the database between No-SQL (Heroku Postgres and MongoDB) For the tech stack; we are using Next.js and Node.js for the front and backend. so please share your professional input. thanks in advance

READ MORE
9 upvotes·21.8K views
Replies (1)
Full-stack Developer && Software Engineer at Self-employed·
Recommends
on
MongoDB

There are many things to consider when choosing between noSQL and SQL databases, but in short SQL databases are often more performant whereas noSQL databases are more friendly for developers to create and maintain.

Since you are building this website up from scratch it would be best to use MongoDB because it makes for an easier developer experience, cuts down on the time needed to setup a database, removes the need for someone specialized in database development (unlike SQL databases which needed to be engineered with tables and rows noSQL databases are rather intuitive), and is overall a better option for a startup.

Some considerations to keep in mind is switching from a noSQL database to a SQL database later on is very, very tricky. If you find that you want to improve performance or later prefer SQL databases it's not going to be an easy feat, however you may come to love noSQL databases and your team will also appreciate its simplicity as well. Another consideration is cost. I'm not familiar with the expenses involved in using Heroku PostgreSQL however MongoDB can be a financial overhead when your project starts to become more user-heavy and you perform more operations and store more data. You may want to perform a quick cost-analysis before choosing your preferred database.

I hope this helps you (or anyone else with a similar question in mind) out a bit and good luck with your marketplace!

READ MORE
3 upvotes·5.9K views
Needs advice
on
Amazon AthenaAmazon Athena
and
Amazon DynamoDBAmazon DynamoDB

So, I have data in Amazon S3 as parquet files and I have it available in the Glue data catalog too. I want to build an AppSync API on top of this data. Now the two options that I am considering are:

  1. Bring the data to Amazon DynamoDB and then build my API on top of this Database.

  2. Add a Lambda function that resolves Amazon Athena queries made by AppSync.

Which of the two approaches will be cost effective?

I would really appreciate some back of the envelope estimates too.

Note: I only expect to make read queries. Thanks.

READ MORE
5 upvotes·26K views
Replies (2)
Lead Developer at Di-Vision Consultion·
Recommends
on
Amazon DynamoDB

Overall, I would think, if the data fits in AWS DynamoDB with being able to Query (not scan) that would be a bit more cost effective. But it all depends on the size and changes in the data.

On relatively stale data Athena could be cheaper on big loads when the data is processed via Glue, Lambda costs are quite small. DynamoDB could become expensive under big loads of (reads and/or writes)

READ MORE
3 upvotes·2.1K views
Engineering leader at Cyren·

It really depends on the data size and number of requests. If latency isnt an issue for this use-case so Amazon Athena might be a good solution for that if you partition the data correctly to be effective enough. DynamoDB is a key-value db, it really depends on the use-case - you might not be able to retrieve the relevant info..

READ MORE
2 upvotes·2.7K views
Needs advice
on
GatsbyGatsbyHexoHexo
and
WordPressWordPress

I have been building a website with Gatsby (for a small group of volunteers). I track it in GitHub and push it to Amazon S3.

I am satisfied with it as a single user; however, I would like to get non-technical teammates to be able to post Markdown blog posts. I tried to teach them to add mdx files, git push, gastby build, and publish with gatsby-plugin-s3, but I am getting a fair amount of resistance :).

So I wonder if there are tools, preferably using Node.js, that allow multi-user blog authors a la wordpress, i.e. with an interface for non technical bloggers, but producing static/pre-rendered web pages.

(PS: I am considering having a node/express.js server where they could upload their mdx file and the server would re-build push and publish for them, without having them install anything, but I'd like to know if something already exists before jumping into this endeavor)

READ MORE
6 upvotes·39.4K views
Replies (1)
Recommends
on
Gatsby
Netlify CMS

If you're after Markdown I would look at https://www.netlifycms.org. I've used it on several projects to allow clients to use Markdown to publish and it integrates really well with Gatsby. You can create your own content structures using it then implement them into your templates. These are all the widgets you can use: https://www.netlifycms.org/docs/widgets/

This keeps it strictly static file driven with no database or need for express etc.

READ MORE
9 upvotes·1 comment·17.1K views
Arnaud Amzallag
Arnaud Amzallag
·
March 31st 2022 at 6:29PM

Thank you, I was skeptical at first, but now that I read more about it, that is a great answer! Before you answered I started to go the route of ghost.js and see how the gatsby build could source from my ghost server, but that would be an endeavor. netlifyCMS seems to acheive what I wanted much more directly. Will continue to learn about it. Thank you!

·
Reply
Senior Software Engineer at Incode technologies·
Needs advice
on
cephceph
and
MinioMinio

I need a replacement for Amazon S3 storage, private storage replacement for s3, which one would you choose?

READ MORE
3 upvotes·12.2K views
Technical Architect at ERP Studio·
Needs advice
on
Amazon DynamoDBAmazon DynamoDB
and
MongoDBMongoDB

We are developing a system in which we have to collect 10 Million records every day. We need a database solution, NoSQL. data is simple logs. We are using AWS for now. I want to know the cheaper solution from both available techs. Amazon S3 or MongoDB.

We have 30 Tables that are collecting these logs.

READ MORE
6 upvotes·26.1K views
Replies (4)
Developer ·

The way I'd approach this is to carry out a survey. Prioritise a list of important criteria, such as performance, functionality, and cost. For example with MongoDB you can archive documents if the data not immediately required to save on costs at the expense of instant access, but if that fits your use case model then you can use that feature. So create a use case test project that actually uses both services as per your use case and see for yourself the results of the tests. Along the way you'll encounter issues perculiar to each platform that you can factor into your final decision, such as comparing how easy it is to use their API, or that the documentation is sparce or confusing. From there you'll have an informed decision and you'll be confident investing further resources into it.

READ MORE
5 upvotes·1 comment·26.1K views
reidmorrison
reidmorrison
·
January 9th 2022 at 5:20PM

If you use Amazon DocumentDB instead of DynamoDB, it is compatible with the MongoDB API. That will keep your code cloud agnostic and you have option of switching between DynamoDB and MongoDB in the future based on whichever ends up being cheapest to run.

·
Reply
Director - NGO "Informational Culture" / Ambassador - OKFN Russia at Infoculture·

I am a big fan of MongoDB and It's great for document storage but I am not really sure that it's the best engine for log storage. If data that you store is "flat" and well-defined than log storage based on engines like Clickhouse or Elasticsearch stach could be much more efficient. Also it's quite important how you reuse collected logs. Do you calculate aggregated metrics? Do you need full search ? And so on.

If logs are really simple and full text search needed than Logstash + Elasticsearch. If you need to calculate a lot of metrics and logs are not just text, but include numbers/values needed for aggregation than Clickhouse.

READ MORE
5 upvotes·24.8K views
View all (4)
Needs advice
on
Amazon S3Amazon S3
and
CloudFlareCloudFlare

Amazon S3 or CloudFlare as a CDN? Anyone like one over the other? My devs are switching to Amazon S3 but I would like other opinions - thank you in advance! Read a little how they differ, but would love an expert opinion. Thanks!

READ MORE
3 upvotes·182.8K views
Replies (1)
Technical Consultant ·
Recommends
on
Amazon S3
in

If you use another service from AWS, S3 is by far the better choice. They're way easier to integrate. One thing to consider is the cost, in some regions AWS price are more expensive than other.

READ MORE
2 upvotes·3 comments·626 views
Jordan Giha
Jordan Giha
·
September 13th 2021 at 11:56PM

Hey Steven, thank you. What would this cost difference be?

·
Reply
Steven Heryanto
Steven Heryanto
·
September 14th 2021 at 3:06AM

Depends on where do want to put your S3 region. You can estimate them in this page https://calculator.aws/#/createCalculator/S3

·
Reply
Jordan Giha
Jordan Giha
·
September 15th 2021 at 1:36AM

thank you! I wish I knew what to fill out here lol

·
Reply
Needs advice
on
AirflowAirflow
and
AWS LambdaAWS Lambda
in

I have data stored in Amazon S3 bucket in parquet file format.

I want this data to be copied from S3 to Amazon Redshift, so I use copy commands to achieve this. But, I need to do this manually. I want to achieve this with some sort of automation such that if any new file comes into S3, it should be copied to the required table in redshift. Can you suggest what different approaches I can use?

READ MORE
8 upvotes·15.1K views
Replies (1)
Backend Software Engineer ·
Recommends
on
aws
sns
aws-s3

Hello Aditya, I haven't tried this myself, but theoretically you can harness the power of Amazon s3 events to generate an event whenever there is a CRUD event on your parquet files in S3. First you'd have to generate the event: https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html

Then you would want to subscribe to the event use the event info to pull the file into Redshift https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html#working-with-event-notifications-subscribe

-Scott

READ MORE
4 upvotes·3.7K views
Technical Lead at Incred Financial Solutions·
Needs advice
on
Amazon S3Amazon S3MetabaseMetabase
and
PrestoPresto

Hi,

We are currently storing the data in Amazon S3 using Apache Parquet format. We are using Presto to query the data from S3 and catalog it using AWS Glue catalog. We have Metabase sitting on top of Presto, where our reports are present. Currently, Presto is becoming too costly for us, and we are looking for alternatives for it but want to use the remaining setup (S3, Metabase) as much as possible. Please suggest alternative approaches.

READ MORE
6 upvotes·102.2K views
Replies (1)
Co-founder at Transloadit·

Hey there, the trick to keeping costs under control is to partition. This means you split up your source files by date, and also query within dates, so that Athena only scans the few files necessary for those dates. I hope that makes sense (and I also hope I understood your question right). This article explains better https://aws.amazon.com/blogs/big-data/analyze-your-amazon-cloudfront-access-logs-at-scale/.

READ MORE
Analyze your Amazon CloudFront access logs at scale | AWS Big Data Blog (aws.amazon.com)
4 upvotes·4.8K views