RAKE in Power BI

1 minute read

I like Python but it needs more acronyms

In this post I am going to go through the basic use of Rapid Automatic Keyword Extraction (RAKE) using the Natural Language Tool Kit (NLTK) as documented here.

So before we get into RAKE what is NLTK? It is a set of tools that do some quite similar stuff to Natural Language Processing as used in the spaCy Position of Speech (POS) and Named Entity Recognition (NER) tools that are explored here. Essentially the tools break down text into their component parts for scripts like RAKE, POS, NER to return results based on analysis of those components.

What is RAKE? This tool extracts key words and phrases based on frequency and without any understanding of the context of the text being analysed. This makes it quite a general tool that will not work well with all text sources but like POS and NER is a good exploratory tool and may work well when joined up to other information about the text data. As an example I have found that using Named Entity Labels and RAKE outputs works well in Power BI.

I have not included the same degree of comments as in previous posts because the code here is pretty much the same so please refer to this post to understand the context of what each step is doing. These examples are also still based on the same assumption of the data used as in previous posts i.e. two columns “id” and “text”.

RAKE Example for Jupyter

    import pandas as pd
    import nltk
    from rake_nltk import Rake
    df = pd.read_csv('../templates/data/Text Small Example.csv')
    data = []
    for idx, row in df.iterrows():
        if not isinstance(row['text'], str):
            continue
        rake = Rake()
        rake.extract_keywords_from_text(row['text'])
        #doc = nlp(row['text'])
        for score, phrase in rake.get_ranked_phrases_with_scores():
            row_data = (row["id"],phrase,score)
            data.append(row_data)
    new_df = pd.DataFrame(data, columns=['id','phrase','score'])
    print(new_df)

RAKE Example for Power BI

    import pandas as pd
    import nltk
    from rake_nltk import Rake
    data = []
    for idx, row in dataset.iterrows():
        if not isinstance(row['text'], str):
            continue
        rake = Rake()
        rake.extract_keywords_from_text(row['text'])
        #doc = nlp(row['text'])
        for score, phrase in rake.get_ranked_phrases_with_scores():
            row_data = (row["id"],phrase,score)
            data.append(row_data)
    new_df = pd.DataFrame(data, columns=['id','phrase','score'])

Hopefully it is clear from the above example that not much needs to change when using Python in Power BI. As mentioned in my previous post a change of mindset to think of data frames as the outputs rather than print, visuals or export.