i have a transcript of a chat between two individuals spanning 6 years and nearly 200,000 messages, with an average token length of 50. i have received a legal subpoerna to produce the chat messages that are responsive to a specific set of criteria (e.g having to do with discrimination and/or treatment at a specific medical facility). the chat data is of a personal nature and most of it is not responsive to the subpoena. i would like to have an llm analyze the chat messages and identify the responsive messages. i have access to datacenter gpu compute instances. what would be efficient ways to approach this task for maximum accuracy?
my budget for this project, not including labor, is $100. i need the llm output as soon as possible, preferably within a day.
the source is signal chat backups; they have been decrypted resulting in sqlite databases of messages & metadata and indiviual binary attachment files of chat media (images, videos, voice memos, pdfs, other files). the databases have the messgae timestamps and senders. i do not know how the media files are linked to their proper places in the chat, if at all.
for ease of processing, the messages have been combined into a single csv file with message, timestamp, and sender, but no attachments are included. most of the attachments are probably non-responsive; if you think they should be processed, that can be done after the initial production.
here is the complete subpoena criteria:
All messages and media related to: • Plaintiff Jennifer Capasso's treatment at Memorial Sloan Kettering Cancer Center (“MSK”). • Any complaint and/or response to such complaint made to any MSK staff member, personnel, patient representative, or any other employee at MSK related to Plaintiff Jennifer Capasso. • Requests to update Plaintiff Jennifer Capasso's patient information at MSK, including requests to update her pronouns or gender identity markers. • The gender markers used to identify Plaintiff Jennifer Capasso at other hospitals where she received medical care. • Prior instances of discrimination that Plaintiff Jennifer Capasso has experienced based on her gender identity in any setting. • Plaintiff Jennifer Capasso’s March 7, 2022 surgery at MSK. • Any emotional distress, mental anguish, pain and suffering, and/or other economic and noneconomic loss resulting from Plaintiff Jennifer Capasso’s treatment at MSK.
The methodology should attemp to balance recall and precision, but err on the side of recall/over-inclusion
The methodology should used a hybrid approach to determine responsivenes, with semantic analysis, embedding comparisons & keyword inclusion.
The methodology should include text normalization and keyword expansion .
The results will be manually reviewed by myself and then by my attorney. We will need a way to delete or redact entire messages or parts thereof that are non-responsive.
cloud/iaas/paas/third party apis can be used if data is not potentially shared with law enforcement or otherwise retained. the gpu compute instances are on services like runpod and vast.ai.
legal counsel has approved of using this approach as long as the methodology is defensibe and documented (methodology documentaton needed) and the results are human-verifiable
the method will need to analyze chunks of chat and not lines in isolation. some topics appear in the chat and then are not mentioned again until dozens or hundreds of messages later; i am not sure if we neeed some means of preserving that context.
the llm step(s) are the first pass; only a sample of the messages deemed non-responsive will be human-reviewed for confirmation.
the results should be in spreadsheet format, one row per message, with line number, original message content, responsiveness score, reasoning, and 2-5 messages around responsive messages. the reasoning can be for blocks of messages rather than individual messages.
must meet time & budget constraints and minimize labor
the subpoena applies to entire chat timeline. i have authority for both parties in the chat; the other party is the plaintiff in the case
either parties' messages may be responsive & they share similar privacy concerns
large models (70b+) can be used and llm fine-tuning can be considered. testing and training datasets can be created.