No menu items!

    A Tour of Python NLP Libraries

    Date:

    Share post:


    Picture Generated with DALL·E 3

     

    NLP, or Pure Language Processing, is a area inside Synthetic Intelligence that focuses on the interplay between human language and computer systems. It tries to discover and apply textual content information so computer systems can perceive the textual content meaningfully.

    Because the NLP area analysis progresses, how we course of textual content information in computer systems has developed. Trendy occasions, we’ve used Python to assist discover and course of information simply.

    With Python changing into the go-to language for exploring textual content information, many libraries have been developed particularly for the NLP area. On this article, we’ll discover numerous unbelievable and helpful NLP libraries.

    So, let’s get into it.
     

    NLTK

     
    NLTK, or Pure Language Device Equipment, is an NLP Python library with many text-processing APIs and industrial-grade wrappers. It’s one of many largest NLP Python libraries utilized by researchers, information scientists, engineers, and others. It’s an ordinary NLP Python library for NLP duties.

    Let’s attempt to discover what NLTK might do. First, we would wish to put in the library with the next code.

     

    Subsequent, we’d see what NLTK might do. First, NLTK can carry out the tokenization course of utilizing the next code:

    import nltk from nltk.tokenize
    import word_tokenize
    
    # Obtain the required assets
    nltk.obtain('punkt')
    
    textual content = "The fruit in the table is a banana"
    tokens = word_tokenize(textual content)
    
    print(tokens)
    

     

    Output>> 
    ['The', 'fruit', 'in', 'the', 'table', 'is', 'a', 'banana']
    

     

    Tokenization mainly would divide every phrase in a sentence into particular person information.

    With NLTK, we are able to additionally carry out Half-of-Speech (POS) Tags on the textual content pattern.

    from nltk.tag import pos_tag
    
    nltk.obtain('averaged_perceptron_tagger')
    
    textual content = "The fruit in the table is a banana"
    pos_tags = pos_tag(tokens)
    
    print(pos_tags)
    

     

    Output>>
    [('The', 'DT'), ('fruit', 'NN'), ('in', 'IN'), ('the', 'DT'), ('table', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('banana', 'NN')]
    

     

    The output of the POS tagger with NLTK is every token and its meant POS tags. For instance, the phrase Fruit is Noun (NN), and the phrase ‘a’ is Determinant (DT).

    It’s additionally attainable to carry out Stemming and Lemmatization with NLTK. Stemming is decreasing a phrase to its base type by chopping its prefixes and suffixes, whereas Lemmatization additionally transforms to the bottom type by contemplating the phrases’ POS and morphological evaluation.

    from nltk.stem import PorterStemmer, WordNetLemmatizer
    nltk.obtain('wordnet')
    nltk.obtain('punkt')
    
    textual content = "The striped bats are hanging on their feet for best"
    tokens = word_tokenize(textual content)
    
    # Stemming
    stemmer = PorterStemmer()
    stems = [stemmer.stem(token) for token in tokens]
    print("Stems:", stems)
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]
    print("Lemmas:", lemmas)
    

     

    Output>> 
    Stems: ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']
    Lemmas: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'for', 'best']
    

     

    You possibly can see that the stemming and lentmatization processes have barely completely different outcomes from the phrases.

    That’s the straightforward utilization of NLTK. You possibly can nonetheless do many issues with them, however the above APIs are essentially the most generally used.
     

    SpaCy

     
    SpaCy is an NLP Python library that’s designed particularly for manufacturing use. It’s a complicated library, and SpaCy is understood for its efficiency and talent to deal with massive quantities of textual content information. It’s a preferable library for trade use in lots of NLP instances.

    To put in SpaCy, you possibly can take a look at their utilization web page. Relying in your necessities, there are numerous mixtures to select from.

    Let’s attempt utilizing SpaCy for the NLP activity. First, we’d attempt performing Named Entity Recognition (NER) with the library. NER is a strategy of figuring out and classifying named entities in textual content into predefined classes, reminiscent of particular person, tackle, location, and extra.

    import spacy
    
    nlp = spacy.load("en_core_web_sm")
    
    textual content = "Brad is working in the U.K. Startup called AIForLife for 7 Months."
    doc = nlp(textual content)
    #Carry out the NER
    for ent in doc.ents:
        print(ent.textual content, ent.label_)
    

     

    Output>>
    Brad PERSON
    the U.Okay. Startup ORG
    7 Months DATE
    

     

    As you possibly can see, the SpaCy pre-trained mannequin understands which phrase throughout the doc could be categorised.

    Subsequent, we are able to use SpaCy to carry out Dependency Parsing and visualize them. Dependency Parsing is a strategy of understanding how every phrase pertains to the opposite by forming a tree construction.

    import spacy
    from spacy import displacy
    
    nlp = spacy.load("en_core_web_sm")
    
    textual content = "SpaCy excels at dependency parsing."
    doc = nlp(textual content)
    for token in doc:
        print(f"{token.text}: {token.dep_}, {token.head.text}")
    
    displacy.render(doc, jupyter=True)
    

     

    Output>> 
    Brad: nsubj, working
    is: aux, working
    working: ROOT, working
    in: prep, working
    the: det, Startup
    U.Okay.: compound, Startup
    Startup: pobj, in
    known as: advcl, working
    AIForLife: oprd, known as
    for: prep, known as
    7: nummod, Months
    Months: pobj, for
    .: punct, working
    

     

    The output ought to embrace all of the phrases with their POS and the place they’re associated. The code above would additionally present tree visualization in your Jupyter Pocket book.

    Lastly, let’s attempt performing textual content similarity with SpaCy. Textual content similarity measures how comparable or associated two items of textual content are. It has many methods and measurements, however we’ll attempt the only one.

    import spacy
    
    nlp = spacy.load("en_core_web_sm")
    
    doc1 = nlp("I like pizza")
    doc2 = nlp("I love hamburger")
    
    # Calculate similarity
    similarity = doc1.similarity(doc2)
    print("Similarity:", similarity)
    

     

    Output>>
    Similarity: 0.6159097609586724
    

     

    The similarity measure measures the similarity between texts by offering an output rating, normally between 0 and 1. The nearer the rating is to 1, the extra comparable each texts are.

    There are nonetheless many issues you are able to do with SpaCy. Discover the documentation to seek out one thing helpful in your work.
     

    TextBlob

     
    TextBlob is an NLP Python library for processing textual information constructed on high of NLTK. It simplifies lots of NLTK’s utilization and might streamline textual content processing duties.

    You possibly can set up TextBlob utilizing the next code:

    pip set up -U textblob
    python -m textblob.download_corpora
    

     

    First, let’s attempt to use TextBlob for NLP duties. The primary one we’d attempt is to do sentiment evaluation with TextBlob. We are able to do this with the code beneath.

    from textblob import TextBlob
    
    textual content = "I am in the top of the world"
    blob = TextBlob(textual content)
    sentiment = blob.sentiment
    
    print(sentiment)
    

     

    Output>>
    Sentiment(polarity=0.5, subjectivity=0.5)
    

     

    The output is a polarity and subjectivity rating. Polarity is the sentiment of the textual content the place the rating ranges from -1 (adverse) to 1 (constructive). On the similar time, the subjectivity rating ranges from 0 (goal) to 1 (subjective).

    We are able to additionally use TextBlob for textual content correction duties. You are able to do that with the next code.

    from textblob import TextBlob
    
    textual content = "I havv goood speling."
    blob = TextBlob(textual content)
    
    # Spelling Correction
    corrected_blob = blob.right()
    print("Corrected Text:", corrected_blob)
    

     

    Output>>
    Corrected Textual content: I've good spelling.
    

     

    Attempt to discover the TextBlob packages to seek out the APIs in your textual content duties.
     

    Gensim

     
    Gensim is an open-source Python NLP library specializing in matter modeling and doc similarity evaluation, particularly for giant and streaming information. It focuses extra on industrial real-time functions.

    Let’s attempt the library. First, we are able to set up them utilizing the next code:

     

    After the set up is completed, we are able to attempt the Gensim functionality. Let’s attempt to do matter modeling with LDA utilizing Gensim.

    import gensim
    from gensim import corpora
    from gensim.fashions import LdaModel
    
    # Pattern paperwork
    paperwork = [
        "Tennis is my favorite sport to play.",
        "Football is a popular competition in certain country.",
        "There are many athletes currently training for the olympic."
    ]
    
    # Preprocess paperwork
    texts = [[word for word in document.lower().split()] for doc in paperwork]
    
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    
    
    #The LDA mannequin
    lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)
    
    matters = lda_model.print_topics()
    for matter in matters:
        print(matter)
    

     

    Output>>
    (0, '0.073*"there" + 0.073*"currently" + 0.073*"olympic." + 0.073*"the" + 0.073*"athletes" + 0.073*"for" + 0.073*"training" + 0.073*"many" + 0.073*"are" + 0.025*"is"')
    (1, '0.094*"is" + 0.057*"football" + 0.057*"certain" + 0.057*"popular" + 0.057*"a" + 0.057*"competition" + 0.057*"country." + 0.057*"in" + 0.057*"favorite" + 0.057*"tennis"')
    

     

    The output is a mixture of phrases from the doc samples that cohesively turn into a subject. You possibly can consider whether or not the end result is smart or not.

    Gensim additionally offers a manner for customers to embed content material. For instance, we use Word2Vec to create embedding from phrases.

    import gensim
    from gensim.fashions import Word2Vec
    
    # Pattern sentences
    sentences = [
        ['machine', 'learning'],
        ['deep', 'learning', 'models'],
        ['natural', 'language', 'processing']
    ]
    
    # Practice Word2Vec mannequin
    mannequin = Word2Vec(sentences, vector_size=20, window=5, min_count=1, employees=4)
    
    vector = mannequin.wv['machine']
    print(vector)
    

     

    
    Output>>
    [ 0.01174188 -0.02259516  0.04194366 -0.04929082  0.0338232   0.01457208
     -0.02466416  0.02199094 -0.00869787  0.03355692  0.04982425 -0.02181222
     -0.00299669 -0.02847819  0.01925411  0.01393313  0.03445538  0.03050548
      0.04769249  0.04636709]
    

     

    There are nonetheless many functions you need to use with Gensim. Attempt to see the documentation and consider your wants.
     

    Conclusion

     

    On this article, we explored a number of Python NLP libraries important for a lot of textual content duties. All of those libraries can be helpful in your work, from Textual content Tokenization to Phrase Embedding. The libraries we’re discussing are:

    1. NLTK
    2. SpaCy
    3. TextBlob
    4. Gensim

    I hope it helps
     
     

    Cornellius Yudha Wijaya is a knowledge science assistant supervisor and information author. Whereas working full-time at Allianz Indonesia, he likes to share Python and information suggestions by way of social media and writing media. Cornellius writes on a wide range of AI and machine studying matters.

    Related articles

    Jaishankar Inukonda, Engineer Lead Sr at Elevance Well being Inc — Key Shifts in Knowledge Engineering, AI in Healthcare, Cloud Platform Choice, Generative AI,...

    On this interview, we communicate with Jaishankar Inukonda, Senior Engineer Lead at Elevance Well being Inc., who brings...

    Technical Analysis of Startups with DualSpace.AI: Ilya Lyamkin on How the Platform Advantages Companies – AI Time Journal

    Ilya Lyamkin, a Senior Software program Engineer with years of expertise in creating high-tech merchandise, has created an...

    The New Black Evaluate: How This AI Is Revolutionizing Style

    Think about this: you are a clothier on a decent deadline, observing a clean sketchpad, desperately making an...