2024 Week 2
Second week! As an update, I finished the embeddings of the entire dataset (800,000 different medical abstracts) from the trialstreamer dump, after long nights of running experiments with the A100 and using PyTorch tricks like making the batch size smaller :)
I was able to also obtain the top 10 abstracts for each claim in the current dataset of 80 claims, filtering by top 10 cosine similarity indices in the dataset with the query (claim) embedding. Yay! A taste of retrieval.
The next steps include visualizing the data by converting it to a json so that I can view it with an awesome Annotation viewer Sebastian made. I think I need to then analyze at least 20 claims and their documents to assess the relevance of the abstracts, and then gauge a sense compared to the previous baseline of using DPR and ROBERTA embeddings.
I also will have to reach out to Somin (author of RedHOT paper) about his code for the DPR baseline, and perhaps try running it myself.
Cheers to more experiments and a new week!