What is Scrapy and what are its top alternatives?
Scrapy is a powerful and flexible open-source web crawling and web scraping framework written in Python. It allows developers to easily scrape data from websites and extract structured information. Scrapy provides a robust set of features such as built-in support for XPath, CSS, and regular expressions, as well as support for handling cookies, sessions, and authentication. It also offers built-in support for parsing and extracting data from various formats like HTML, XML, and JSON. However, Scrapy may have a learning curve for beginners and may require some programming knowledge to use effectively.
- Beautiful Soup: Beautiful Soup is a popular Python library for parsing HTML and XML documents. It provides simple methods for navigating and searching parsed data, making it ideal for beginners. Pros: Easy to use, great for small projects. Cons: Limited support for web scraping automation.
- Puppeteer: Puppeteer is a Node.js library that provides a high-level API for controlling headless Chrome or Chromium browsers. It allows for automated web scraping, crawling, and testing. Pros: Supports browser automation, great for dynamic websites. Cons: Requires knowledge of JavaScript.
- Selenium: Selenium is a popular tool for automating web browsers and testing web applications. It supports multiple programming languages and browsers. Pros: Cross-browser support, versatile for testing. Cons: Slower than other tools, can be complex to set up.
- ScrapyRT: ScrapyRT is a lightweight and easy-to-use web API for running Scrapy spiders. It allows for integrating Scrapy spiders into any application easily. Pros: Seamless integration with Scrapy, easy to deploy. Cons: Limited features compared to full Scrapy framework.
- PySpider: PySpider is an open-source web crawling and web scraping framework written in Python. It provides a web-based user interface for managing and monitoring spiders. Pros: User-friendly interface, supports distributed crawling. Cons: Limited documentation and community support.
- Apache Nutch: Apache Nutch is a highly extensible and scalable open-source web crawler written in Java. It is widely used for web scraping, text mining, and search engine indexing. Pros: Scalable architecture, supports multiple data formats. Cons: Steeper learning curve, requires Java knowledge.
- ParseHub: ParseHub is a visual web scraping tool that allows users to extract data from websites without any programming knowledge. It offers a point-and-click interface for building scraping workflows. Pros: No coding required, user-friendly interface. Cons: Limited customization options, may not be suitable for complex scraping tasks.
- Octoparse: Octoparse is a desktop application for web scraping that offers both visual operation and advanced data extraction features. It supports scraping data from various websites, including dynamic and JavaScript-heavy sites. Pros: Visual scraping interface, supports cloud extraction. Cons: Limited free version, may have performance issues with large datasets.
- WebHarvy: WebHarvy is a visual web scraper that allows users to easily extract data from web pages using a point-and-click interface. It supports scraping text, images, URLs, and more. Pros: Intuitive interface, supports scheduling and automation. Cons: Limited customization options, may not handle complex data structures well.
- MechanicalSoup: MechanicalSoup is a Python library that provides automated interaction with websites using Python. It simplifies the process of submitting forms, extracting data, and navigating web pages. Pros: Lightweight, easy to use. Cons: Limited functionality compared to full-featured scraping frameworks.
Top Alternatives to Scrapy
- Selenium
Selenium automates browsers. That's it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) also be automated as well. ...
- import.io
import.io is a free web-based platform that puts the power of the machine readable web in your hands. Using our tools you can create an API or crawl an entire website in a fraction of the time of traditional methods, no coding required. ...
- BeautifulSoup
It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. ...
- Puppeteer
Puppeteer is a Node library which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome. ...
- Postman
It is the only complete API development environment, used by nearly five million developers and more than 100,000 companies worldwide. ...
- Postman
It is the only complete API development environment, used by nearly five million developers and more than 100,000 companies worldwide. ...
- Stack Overflow
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's built and run by you as part of the Stack Exchange network of Q&A sites. With your help, we're working together to build a library of detailed answers to every question about programming. ...
- Google Maps
Create rich applications and stunning visualisations of your data, leveraging the comprehensiveness, accuracy, and usability of Google Maps and a modern web platform that scales as you grow. ...
Scrapy alternatives & related posts
- Automates browsers177
- Testing154
- Essential tool for running test automation101
- Record-Playback24
- Remote Control24
- Data crawling8
- Supports end to end testing7
- Easy set up6
- Functional testing6
- The Most flexible monitoring system4
- End to End Testing3
- Easy to integrate with build tools3
- Comparing the performance selenium is faster than jasm2
- Record and playback2
- Compatible with Python2
- Easy to scale2
- Integration Tests2
- Integrated into Selenium-Jupiter framework0
- Flaky tests8
- Slow as needs to make browser (even with no gui)4
- Update browser drivers2
related Selenium posts
When you think about test automation, it’s crucial to make it everyone’s responsibility (not just QA Engineers'). We started with Selenium and Java, but with our platform revolving around Ruby, Elixir and JavaScript, QA Engineers were left alone to automate tests. Cypress was the answer, as we could switch to JS and simply involve more people from day one. There's a downside too, as it meant testing on Chrome only, but that was "good enough" for us + if really needed we can always cover some specific cases in a different way.
For our digital QA organization to support a complex hybrid monolith/microservice architecture, our team took on the lofty goal of building out a commonized UI test automation framework. One of the primary requisites included a technical minimalist threshold such that an engineer or analyst with fundamental knowledge of JavaScript could automate their tests with greater ease. Just to list a few: - Nightwatchjs - Selenium - Cucumber - GitHub - Go.CD - Docker - ExpressJS - React - PostgreSQL
With this structure, we're able to combine the automation efforts of each team member into a centralized repository while also providing new relevant metrics to business owners.
- Easy setup8
- Native desktop app5
- Free lead generation tool5
- Continuous updates3
- Features based on users suggestions3
related import.io posts
- Parsed html even when poorly formed3
- It just works1
related BeautifulSoup posts
- Very well documented10
- Scriptable web browser10
- Promise based6
- Chrome only10
related Puppeteer posts
Currently, we are using Protractor in our project. Since Protractor isn't updated anymore, we are looking for a new tool. The strongest suggestions are WebdriverIO or Puppeteer. Please help me figure out what tool would make the transition fastest and easiest. Please note that Protractor uses its own locator system, and we want the switch to be as simple as possible. Thank you!
I work in a company building web apps with AngularJS. I started using Selenium for tests automation, as I am more familiar with Python. However, I found some difficulties, like the impossibility of using IDs and fixed lists of classes, ending up with using xpaths most, which unfortunately could change with fixes and modifications in the code.
So, I started using Puppeteer, but I am still learning. It seems easier to find elements on the webpage, even if the creation and managing of arrays of elements seem to be a little bit more complicated than in Selenium, but it could be also due to my poor knowledge of JavaScript.
Any comments on this comparison and also on comparisons with similar tools are welcome! :)
- Easy to use490
- Great tool369
- Makes developing rest api's easy peasy276
- Easy setup, looks good156
- The best api workflow out there144
- It's the best53
- History feature53
- Adds real value to my workflow44
- Great interface that magically predicts your needs43
- The best in class app35
- Can save and share script12
- Fully featured without looking cluttered10
- Collections8
- Option to run scrips8
- Global/Environment Variables8
- Shareable Collections7
- Dead simple and useful. Excellent7
- Dark theme easy on the eyes7
- Awesome customer support6
- Great integration with newman6
- Documentation5
- Simple5
- The test script is useful5
- Saves responses4
- This has simplified my testing significantly4
- Makes testing API's as easy as 1,2,34
- Easy as pie4
- API-network3
- I'd recommend it to everyone who works with apis3
- Mocking API calls with predefined response3
- Now supports GraphQL2
- Postman Runner CI Integration2
- Easy to setup, test and provides test storage2
- Continuous integration using newman2
- Pre-request Script and Test attributes are invaluable2
- Runner2
- Graph2
- <a href="http://fixbit.com/">useful tool</a>1
- Stores credentials in HTTP10
- Bloated features and UI9
- Cumbersome to switch authentication tokens8
- Poor GraphQL support7
- Expensive5
- Not free after 5 users3
- Can't prompt for per-request variables3
- Import swagger1
- Support websocket1
- Import curl1
related Postman posts
We just launched the Segment Config API (try it out for yourself here) — a set of public REST APIs that enable you to manage your Segment configuration. A public API is only as good as its #documentation. For the API reference doc we are using Postman.
Postman is an “API development environment”. You download the desktop app, and build API requests by URL and payload. Over time you can build up a set of requests and organize them into a “Postman Collection”. You can generalize a collection with “collection variables”. This allows you to parameterize things like username
, password
and workspace_name
so a user can fill their own values in before making an API call. This makes it possible to use Postman for one-off API tasks instead of writing code.
Then you can add Markdown content to the entire collection, a folder of related methods, and/or every API method to explain how the APIs work. You can publish a collection and easily share it with a URL.
This turns Postman from a personal #API utility to full-blown public interactive API documentation. The result is a great looking web page with all the API calls, docs and sample requests and responses in one place. Check out the results here.
Postman’s powers don’t end here. You can automate Postman with “test scripts” and have it periodically run a collection scripts as “monitors”. We now have #QA around all the APIs in public docs to make sure they are always correct
Along the way we tried other techniques for documenting APIs like ReadMe.io or Swagger UI. These required a lot of effort to customize.
Writing and maintaining a Postman collection takes some work, but the resulting documentation site, interactivity and API testing tools are well worth it.
Our whole Node.js backend stack consists of the following tools:
- Lerna as a tool for multi package and multi repository management
- npm as package manager
- NestJS as Node.js framework
- TypeScript as programming language
- ExpressJS as web server
- Swagger UI for visualizing and interacting with the API’s resources
- Postman as a tool for API development
- TypeORM as object relational mapping layer
- JSON Web Token for access token management
The main reason we have chosen Node.js over PHP is related to the following artifacts:
- Made for the web and widely in use: Node.js is a software platform for developing server-side network services. Well-known projects that rely on Node.js include the blogging software Ghost, the project management tool Trello and the operating system WebOS. Node.js requires the JavaScript runtime environment V8, which was specially developed by Google for the popular Chrome browser. This guarantees a very resource-saving architecture, which qualifies Node.js especially for the operation of a web server. Ryan Dahl, the developer of Node.js, released the first stable version on May 27, 2009. He developed Node.js out of dissatisfaction with the possibilities that JavaScript offered at the time. The basic functionality of Node.js has been mapped with JavaScript since the first version, which can be expanded with a large number of different modules. The current package managers (npm or Yarn) for Node.js know more than 1,000,000 of these modules.
- Fast server-side solutions: Node.js adopts the JavaScript "event-loop" to create non-blocking I/O applications that conveniently serve simultaneous events. With the standard available asynchronous processing within JavaScript/TypeScript, highly scalable, server-side solutions can be realized. The efficient use of the CPU and the RAM is maximized and more simultaneous requests can be processed than with conventional multi-thread servers.
- A language along the entire stack: Widely used frameworks such as React or AngularJS or Vue.js, which we prefer, are written in JavaScript/TypeScript. If Node.js is now used on the server side, you can use all the advantages of a uniform script language throughout the entire application development. The same language in the back- and frontend simplifies the maintenance of the application and also the coordination within the development team.
- Flexibility: Node.js sets very few strict dependencies, rules and guidelines and thus grants a high degree of flexibility in application development. There are no strict conventions so that the appropriate architecture, design structures, modules and features can be freely selected for the development.
- Easy to use490
- Great tool369
- Makes developing rest api's easy peasy276
- Easy setup, looks good156
- The best api workflow out there144
- It's the best53
- History feature53
- Adds real value to my workflow44
- Great interface that magically predicts your needs43
- The best in class app35
- Can save and share script12
- Fully featured without looking cluttered10
- Collections8
- Option to run scrips8
- Global/Environment Variables8
- Shareable Collections7
- Dead simple and useful. Excellent7
- Dark theme easy on the eyes7
- Awesome customer support6
- Great integration with newman6
- Documentation5
- Simple5
- The test script is useful5
- Saves responses4
- This has simplified my testing significantly4
- Makes testing API's as easy as 1,2,34
- Easy as pie4
- API-network3
- I'd recommend it to everyone who works with apis3
- Mocking API calls with predefined response3
- Now supports GraphQL2
- Postman Runner CI Integration2
- Easy to setup, test and provides test storage2
- Continuous integration using newman2
- Pre-request Script and Test attributes are invaluable2
- Runner2
- Graph2
- <a href="http://fixbit.com/">useful tool</a>1
- Stores credentials in HTTP10
- Bloated features and UI9
- Cumbersome to switch authentication tokens8
- Poor GraphQL support7
- Expensive5
- Not free after 5 users3
- Can't prompt for per-request variables3
- Import swagger1
- Support websocket1
- Import curl1
related Postman posts
We just launched the Segment Config API (try it out for yourself here) — a set of public REST APIs that enable you to manage your Segment configuration. A public API is only as good as its #documentation. For the API reference doc we are using Postman.
Postman is an “API development environment”. You download the desktop app, and build API requests by URL and payload. Over time you can build up a set of requests and organize them into a “Postman Collection”. You can generalize a collection with “collection variables”. This allows you to parameterize things like username
, password
and workspace_name
so a user can fill their own values in before making an API call. This makes it possible to use Postman for one-off API tasks instead of writing code.
Then you can add Markdown content to the entire collection, a folder of related methods, and/or every API method to explain how the APIs work. You can publish a collection and easily share it with a URL.
This turns Postman from a personal #API utility to full-blown public interactive API documentation. The result is a great looking web page with all the API calls, docs and sample requests and responses in one place. Check out the results here.
Postman’s powers don’t end here. You can automate Postman with “test scripts” and have it periodically run a collection scripts as “monitors”. We now have #QA around all the APIs in public docs to make sure they are always correct
Along the way we tried other techniques for documenting APIs like ReadMe.io or Swagger UI. These required a lot of effort to customize.
Writing and maintaining a Postman collection takes some work, but the resulting documentation site, interactivity and API testing tools are well worth it.
Our whole Node.js backend stack consists of the following tools:
- Lerna as a tool for multi package and multi repository management
- npm as package manager
- NestJS as Node.js framework
- TypeScript as programming language
- ExpressJS as web server
- Swagger UI for visualizing and interacting with the API’s resources
- Postman as a tool for API development
- TypeORM as object relational mapping layer
- JSON Web Token for access token management
The main reason we have chosen Node.js over PHP is related to the following artifacts:
- Made for the web and widely in use: Node.js is a software platform for developing server-side network services. Well-known projects that rely on Node.js include the blogging software Ghost, the project management tool Trello and the operating system WebOS. Node.js requires the JavaScript runtime environment V8, which was specially developed by Google for the popular Chrome browser. This guarantees a very resource-saving architecture, which qualifies Node.js especially for the operation of a web server. Ryan Dahl, the developer of Node.js, released the first stable version on May 27, 2009. He developed Node.js out of dissatisfaction with the possibilities that JavaScript offered at the time. The basic functionality of Node.js has been mapped with JavaScript since the first version, which can be expanded with a large number of different modules. The current package managers (npm or Yarn) for Node.js know more than 1,000,000 of these modules.
- Fast server-side solutions: Node.js adopts the JavaScript "event-loop" to create non-blocking I/O applications that conveniently serve simultaneous events. With the standard available asynchronous processing within JavaScript/TypeScript, highly scalable, server-side solutions can be realized. The efficient use of the CPU and the RAM is maximized and more simultaneous requests can be processed than with conventional multi-thread servers.
- A language along the entire stack: Widely used frameworks such as React or AngularJS or Vue.js, which we prefer, are written in JavaScript/TypeScript. If Node.js is now used on the server side, you can use all the advantages of a uniform script language throughout the entire application development. The same language in the back- and frontend simplifies the maintenance of the application and also the coordination within the development team.
- Flexibility: Node.js sets very few strict dependencies, rules and guidelines and thus grants a high degree of flexibility in application development. There are no strict conventions so that the appropriate architecture, design structures, modules and features can be freely selected for the development.
Stack Overflow
- Scary smart community257
- Knows all206
- Voting system142
- Good questions134
- Good SEO83
- Addictive22
- Tight focus14
- Share and gain knowledge10
- Useful7
- Fast loading3
- Gamification2
- Knows everyone1
- Experts share experience and answer questions1
- Stack overflow to developers As google to net surfers1
- Questions answered quickly1
- No annoying ads1
- No spam1
- Fast community response1
- Good moderators1
- Quick answers from users1
- Good answers1
- User reputation ranking1
- Efficient answers1
- Leading developer community1
- Not welcoming to newbies3
- Unfair downvoting3
- Unfriendly moderators3
- No opinion based questions3
- Mean users3
- Limited to types of questions it can accept2
related Stack Overflow posts
Google Analytics is a great tool to analyze your traffic. To debug our software and ask questions, we love to use Postman and Stack Overflow. Google Drive helps our team to share documents. We're able to build our great products through the APIs by Google Maps, CloudFlare, Stripe, PayPal, Twilio, Let's Encrypt, and TensorFlow.
Google Maps
- Free253
- Address input through maps api136
- Sharable Directions82
- Google Earth47
- Unique46
- Custom maps designing3
- Google Attributions and logo4
- Only map allowed alongside google place autocomplete1
related Google Maps posts
Google Analytics is a great tool to analyze your traffic. To debug our software and ask questions, we love to use Postman and Stack Overflow. Google Drive helps our team to share documents. We're able to build our great products through the APIs by Google Maps, CloudFlare, Stripe, PayPal, Twilio, Let's Encrypt, and TensorFlow.
A huge component of our product relies on gathering public data about locations of interest. Google Places API gives us that ability in the most efficient way. Since we are primarily going to be using as google data as a source of information for our MVP, we might as well start integrating the Google Places API in our system. We have worked with Google Maps in the past and we might take some inspiration from our previous projects onto this one.