How we created the wordcloud for our research page
Posted on Mon 08 January 2024 in tutorials
The wordcloud found on our research page was generated by parsing the text of our research papers. Here's who we did it.
Downloading the Papers
First, we need to download the papers. For this, we used the Python library crossref-commons
to get a list of our papers. Here is a simple example of how to do this:
from crossref_commons.iteration import iterate_publications_as_json
filter = {'orcid': orcid}
queries = {'query.author': authorname}
publications_metadata = []
for entry in iterate_publications_as_json(max_results=1000, queries=queries, filter=filter):
metadata = {}
metadata["doi"] = entry["DOI"]
metadata["year"] = entry["created"]["date-parts"][0][0]
month = entry["created"]["date-parts"][0][1]
day = entry["created"]["date-parts"][0][2]
metadata["date"] = f"{metadata['year']}-{month}-{day}"
metadata["type"] = "paper" if entry["type"] == "journal-article" else entry["type"]
metadata["journal"] = entry["container-title"][0] if "container-title" in entry else ""
...
Extracting Text from the Papers
Next, we extract the text from each downloaded PDF using the pymupdf
library. Here is a basic example of how to extract text from a PDF:
import fitz # this is pymupdf
from pathlib import Path
papers_dir = Path(".../pdfs")
for paper in papers_dir.glob("*.pdf"):
text_file_path = output_dir / paper.stem
text_file_path = text_file_path.with_suffix(".txt")
print(f"Converting {paper}")
text = ""
page_txt = []
try:
doc = fitz.open(paper)
for page in doc:
page_txt.append(page.get_text())
text = "\n----!@#$NewPage!@#$----\n".join(page_txt)
with open(text_file_path, "w") as f:
f.write(text)
Generating the Wordcloud
Finally, we use the wordcloud
library to generate the wordcloud image from the extracted text. Here is a simple example:
from wordcloud import WordCloud
from pathlib import Path
txt_path = Path('.../txts')
merged_txt = ''
for txt_file in txt_path.glob('*.txt'):
with open(txt_file, 'r') as f:
merged_txt += f.read() + '\n'
english_stopwords = ["some", "stopwords", "for", "english", "text", "..."]
# Create a word cloud
wordcloud = WordCloud(width=800, height=800, stopwords=english_stopwords).generate(merged_txt)
# Save the word cloud as an image
wordcloud.to_file(output_file)
And that's it! With these steps, you can generate a wordcloud image from a collection of research papers. This can be a useful tool for quickly visualizing the most frequently used words in your research.