What is Spark NLP?
It is a Natural Language Processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. It comes with 160+ pretrained pipelines and models in more than 20+ languages.
Spark NLP is a tool in the NLP / Sentiment Analysis category of a tech stack.
Spark NLP is an open source tool with 3.3K GitHub stars and 659 GitHub forks. Here’s a link to Spark NLP's open source repository on GitHub
Who uses Spark NLP?
Companies
5 companies reportedly use Spark NLP in their tech stacks, including Newzera, Ukuli Data, and Rabbitique.
Developers
20 developers on StackShare have stated that they use Spark NLP.
Spark NLP Integrations
Spark NLP's Features
- Tokenization
- Stop Words Removal
- Normalizer
- Stemmer
- Lemmatizer
- NGrams
- Regex Matching
- Text Matching
- Chunking
- Date Matcher
- Part-of-speech tagging
- Sentence Detector
- Dependency parsing (Labeled/unlabled)
- Sentiment Detection (ML models)
- Spell Checker (ML and DL models)
- Word Embeddings (GloVe and Word2Vec)
- BERT Embeddings
- ELMO Embeddings
- Universal Sentence EncoderSentence Embeddings
- Chunk Embeddings
Spark NLP Alternatives & Comparisons
What are some alternatives to Spark NLP?
SpaCy
It is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. It comes with pre-trained statistical models and word vectors, and currently supports tokenization for 49+ languages.
rasa NLU
rasa NLU (Natural Language Understanding) is a tool for intent classification and entity extraction. You can think of rasa NLU as a set of high level APIs for building your own language parser using existing NLP and ML libraries.
Transformers
It provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch.
Gensim
It is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
Amazon Comprehend
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to discover insights from text. Amazon Comprehend provides Keyphrase Extraction, Sentiment Analysis, Entity Recognition, Topic Modeling, and Language Detection APIs so you can easily integrate natural language processing into your applications.