How we created the wordcloud for our research page

Posted on Mon 08 January 2024 in tutorials

The wordcloud found on our research page was generated by parsing the text of our research papers. Here's who we did it.

Wordcloud

Downloading the Papers

First, we need to download the papers. For this, we used the Python library crossref-commons to get a list of our papers. Here is a simple example of how to do this:

from crossref_commons.iteration import iterate_publications_as_json

filter = {'orcid': orcid}
queries = {'query.author': authorname}

publications_metadata = []
for entry in iterate_publications_as_json(max_results=1000, queries=queries, filter=filter):
    metadata = {}
    metadata["doi"] = entry["DOI"]
    metadata["year"] = entry["created"]["date-parts"][0][0]
    month = entry["created"]["date-parts"][0][1]
    day = entry["created"]["date-parts"][0][2]
    metadata["date"] = f"{metadata['year']}-{month}-{day}"
    metadata["type"] = "paper" if entry["type"] == "journal-article" else entry["type"]
    metadata["journal"] = entry["container-title"][0] if "container-title" in entry else ""
    ...

Extracting Text from the Papers

Next, we extract the text from each downloaded PDF using the pymupdf library. Here is a basic example of how to extract text from a PDF:

import fitz  # this is pymupdf
from pathlib import Path

papers_dir = Path(".../pdfs")

for paper in papers_dir.glob("*.pdf"):
    text_file_path = output_dir / paper.stem
    text_file_path = text_file_path.with_suffix(".txt")

    print(f"Converting {paper}")
    text = ""
    page_txt = []
    try:
        doc = fitz.open(paper)
        for page in doc:
            page_txt.append(page.get_text())

    text = "\n----!@#$NewPage!@#$----\n".join(page_txt)    

    with open(text_file_path, "w") as f:
        f.write(text)

Generating the Wordcloud

Finally, we use the wordcloud library to generate the wordcloud image from the extracted text. Here is a simple example:

from wordcloud import WordCloud
from pathlib import Path

txt_path = Path('.../txts')

merged_txt = ''
for txt_file in txt_path.glob('*.txt'):
    with open(txt_file, 'r') as f:
        merged_txt += f.read() + '\n'

english_stopwords = ["some", "stopwords", "for", "english", "text", "..."]

# Create a word cloud
wordcloud = WordCloud(width=800, height=800, stopwords=english_stopwords).generate(merged_txt)

# Save the word cloud as an image
wordcloud.to_file(output_file)

And that's it! With these steps, you can generate a wordcloud image from a collection of research papers. This can be a useful tool for quickly visualizing the most frequently used words in your research.