As an important component of medical research, clinical trials require thoughtful and deliberate identification of potential participants. Clinical trial eligibility depends on many factors, including past health conditions, previous treatments received, age, and stage of the disease. Due to the complexity of the underlying disease process and the associated clinical trial, identifying eligible candidates can require hours of manual effort to go through.

For this project, we set out to build a decision support tool using publicly available data to flag eligibility for a melanoma cancer trial based on short-form clinical statements that include types of previous interventions. Data preparation resulted in a set of approximately 20,000 unlabeled statements loaded into Snorkel Flow to help label the documents programmatically. It would have taken almost three months to label these clinical statements manually, but with Snorkel’s help, we could easily build ground truth and label them in less than a week.

What is Ground Truth and Why is it Important?

The starting point for programmatic labeling is a portion of the data with the ground truth. A domain expert usually labels this subset and is critical to generalizing models for high performance.

The process of generating this subset is often lengthy, manual, and cumbersome. Although the ground truth is typically a small fraction of the entire dataset, ideally less than 5%, care must be taken to get these labels correctly. This set serves as the basis for generating labeling functions (LFs), training language models, and deploying a solution. Snorkel Flow as a platform enables a more straightforward labeling process for experts. It can capture various signal sources for labeling with ease through keywords, patterns, database lookups, and external models. To do this, LFs are iteratively generated and refined. These functions are based on the ground truth and will then be applied to the entire document set.

What are Snorkel Flow Labeling Functions (LFs)?

Snorkel AI describes labeling functions (LFs) as the fundamental building blocks of programmatic labeling that encode diverse sources of weak labeling signals to produce high-quality labeled data at scale. More simply, you can think of LFs as generalized rules created to assist Snorkel Flow in programmatically labeling your unlabeled data. These functions can be based on the inclusion or exclusion of specific keywords. LFs can also be based on phrase proximity to other phrases. For our particular use case, these rules help Snorkel Flow determine whether a certain document will be eligible for the melanoma trial (1) or not (0).  

How do Labeling Functions (LFs) Work?

When reviewing LFs and their performance, it is vital to understand how the LF was built and what the function tells you. Recognition of the words or pattern may positively endorse the class, but it does not need to be. For example, in our data, ‘situ cervix’ is often mentioned in documents that are classified as eligible, but it is most often found in conjunction with ‘exclusion criteria’; the presence of’ situ cervix’ itself is not the reason why the patient is eligible for the melanoma trial.

Figure 1 shows a simple example of an LF that looks for the phrases’ skin cancer’, ‘carcinoma situ,’ and ‘situ cervix.’ Upon the presence of any of these phrases, Snorkel recommends Class 1, or eligible for trial.

The quality of an LF is determined by precision and coverage. Precision indicates how accurately our LF classifies a given clinical statement. Coverage tells us how generalizable the LF is across the entire corpus, or in other words, how applicable it is. Both metrics are key to assessing the quality of an LF. In this example, the LF is highly precise and applies to ~11.5% of all documents in our corpus. Highly precise LFs with good coverage can classify documents accurately and help build better-performing models for generalized predictions.

Figure 1. A labeling function that searches for ‘skin cancer,’ ‘carcinoma situ,’ and ‘situ cervix’ in the condition field with its corresponding precision and coverage

Optimizing Labeling Functions and Utilizing Snorkel Flow Features

One option to increase the efficiency of a series of LFs is to base one single LF on a user-defined dictionary or a list of keywords. As mentioned previously, pattern recognition through Regex is also available in addition to using keywords to create LFs.

During development, high precision (>80%) and moderate-to-high coverage (10-20%) were targeted for an LF to be included in the model.

Tradeoffs between higher precision with potentially lower coverage and higher coverage with potentially lower accuracy will naturally occur based on your use cases. The choice to create additional LFs or to refine existing ones will depend on the available time to invest and its impact on the overall performance.

Snorkel Flow has a built-in way to improve the collection of LFs iteratively. It provides features such as the fast model–a quick logistic regression model which lets users evaluate different LFs, as well as the clarity matrix–a recommendation table that provides pointers on the accuracy of our LFs. They will be discussed in more depth in the subsequent sections. Snorkel Flow is an end-to-end platform that lets you train and deploy models. The Application Studio facilitates this end-to-end feature of Snorkel Flow.

Applying Data Transformations Using Operators in the Application Studio

Snorkel Flow provides a way to process your data using operators. Operators are low-code or no-code building blocks that help you create end-to-end applications seamlessly. Depending on the data type you are analyzing (i.e., text-based, conversational chat, etc.), you can commit different nodes to the application studio in a directed, acyclic manner.

Figure 2. The cancer trial use case consisted of an EmbeddingFeaturizer operator committed before the model node

The model node is committed to the application studio by default. Operators can be applied flexibly before or after this node. You can perform a pre-processing step like tokenization and whitespace removal and apply a post-processing operator which filters results based on the model prediction generated. Here are several operators available in Snorkel:

Text Extraction

  •  Scans and extracts word spans from text-based documents, including PDFs, based on the maximum token length provided.

Noun Builder

  •  Calls spaCy to generate noun phrases, a group of words that functions as a noun, of a given sentence portion (i.e., ‘the big plane’ is the noun phrase for the sentence ‘the big plane is flying over the mountain’). The operator iterates over various base noun phrases in a document. Users can define a regex span to filter to the sentence portion that matches the regular expression.

Numerical and Ordinal Transformer

  • Converts numbers to words, both numerical (i.e., 1 to one) and ordinal values (i.e., 1st to first)

Embedding is an approach in which sentences and documents are represented as high-dimensional numeric vectors. In our use case, as seen in Figure 2, an operator called EmbeddingFeaturizer is utilized before the model node.

Figure 3. Parameters passed into the EmbeddingFeaturizer operator, which includes text column, target column, and embedding model to use

Users first specify the column they want to transform, the field, or in our example, [condition]. Snorkel Flow will then generate a new column, the target field [condition_embedding], to display the transformed texts as vectors. For our use case, the condition column contains short-form clinical statements. The operator converts each of these statements to a numeric vector to enhance Snorkel’s ability to work with text data efficiently. By default, Snorkel offers two methods to generate embeddings, SimCSE and SpaCy. It is possible to bring in your own embedding models via custom operators, which will require Snorkel’s Python SDK.

Since each observation is now represented as a vector, Cluster view in the studio visually represents the documents, and Snorkel automatically forms clusters to tie documents together.

Figure 4. Snorkel generates clusters of documents based on the EmbeddingFeaturizer tool

The Autoclusters feature groups texts based on the similarity and frequency of n-grams appearing in different documents. For example, in Figure 4, Snorkel automatically recognizes ‘skin diagnosis,’ ‘performance status,’ and ‘status zero’ as n-grams that commonly appear in documents of a specific class. What is neat is that this clustering procedure is done behind the scenes. Snorkel instantly recommends a potential cluster LF based on the grouping observed.

Similarly, by manually highlighting a set of points, user-defined clusters can also be generated. Snorkel generates a set of n-grams recommendations based on the collection of documents highlighted and presents the option to apply this new information as an LF. Not only does this simplify the labeling procedure and classification model pipeline, but this also provides flexibility in helping users navigate their data from different angles.

Figure 5. A cluster LF created based on a manually drawn bounding region

Training Models in Snorkel Studio

After creating LFs for each class with acceptable coverage and precision, Snorkel programmatically labels as many data points as possible. Once the data is labeled, Snorkel trains a model to classify the medical notes into one of two classes: eligible (1) or ineligible (0) for this melanoma clinical trial. There are three ways to create a model from the Snorkel Studio:

Custom Model

  •  A full, custom model that can be exported and put into production. In addition to creating your own custom model, Snorkel also provides a set of built-in open-source models based on the task type (i.e., classification problem, sequence tagging, etc.)

Fast Model

  •  A streamlined logistic regression model with the goal of helping users iterate quickly to obtain a good set of LFs


  •  A training suite allowing users to find a well-performing model using grid search with automated hyperparameter tuning, automated scanning of various models, and auto feature field selection

Before deploying a model, experimentation with various techniques can help to generate a solid classification model. In this use case, the initial model type was the Fast Model, which is implemented to gauge how the LFs perform quickly.

This model type allows us to modify or add LFs as needed and ensures the swift labeling of documents. This alone is time savings, as it would likely take medical professionals significantly longer to go through each document, one by one. Since the Fast Model is designed to live up to its name by producing results rapidly, various iterations of LFs were evaluated on the model’s accuracy metric and F1 score to determine the optimal set of LFs to include. Once the performance is at an acceptable level, a deployable production model can be created using the Custom Model and/or AutoML tools.

Figure 6. Left: Fast Model F1 score comparison with different LF packages. Right: Clarity Matrix of Model 3 showing areas on how to improve the model

As shown in Figure 6, LF package version 4 delivered the best accuracy and F1 score. This set of 18 LFs was utilized for training the full model with the Custom Model option. Snorkel offers various built-in models like XGBoost using TF-IDF, memory-optimized logistic regression, and BERT with classifier fine-tuning. The list of models varies based on the application type you create. If you have your own custom model, you can register it through the SDK, which will also appear in the list.

Aimpoint Digital has the expertise to help you integrate custom models into your workflow. Our NLP and domain experts are ready to assist you in developing large language models for use cases ranging from text extraction to sequence tagging.

Figure 7. The built-in models available in Snorkel are application-specific 

Snorkel also has multiple parameters which users can tune to improve their model. Users can find configurable mechanisms to prevent overfitting in the Train Options tab. Beyond their importance in model maintenance, these options help achieve the right model complexity with a balanced bias-variance tradeoff.

All models trained can be viewed in the Trained Models section in the Overview. As seen in Figure 8, in addition to the model architecture (i.e. XGBoost, Logistic Regression) employed, Snorkel also includes information such as the accuracy and F1 score obtained for each split and the training set utilized to generate the labels. Since the training set will most likely change as LFs are tweaked, Snorkel provides a warning symbol to users suggesting that metrics shown in the list are stale and those models must be re-trained accordingly. Once models are trained with the desired metric success, models must be committed to be further utilized through a deployment (as seen in purple below).

Figure 8. A list of all models trained can be seen from the Trained Models in the application overview section

Deploying the Application

To further facilitate the end-to-end model design, Snorkel offers built-in methods to deploy your applications. Users can compile their pipeline into a Snorkel Flow Package format in which they must provide the expected input columns to make prediction requests.

Figure 9. Deploying the use case application as a Snorkel Flow Package

Once deployed, navigate to the packaged application on the Deployments page to pull requests from Snorkel’s sandbox environment. Users can either make single or batch requests bearing in mind that runtime can vary based on the number of requests made and your pipeline complexity (i.e., the number of operators committed). For each request made, provide the required fields as originally listed when deploying the application. The request output is a downloadable JSON file that includes the predicted class and the predicted probabilities of each class. In addition, since the EmbeddingFeaturizer node is included in our flow, the JSON file also returns the vector representation of the documents, as seen in Figure 10.

Figure 10. Users provide a payload to the sandbox to obtain the predicted class from the classification model

Similarly, users can also deploy the model using MLflow. The advantage of using the MLFlow format is its interoperability across platforms. It is compatible with Snorkel’s in-platform tools, such as inference APIs, prediction APIs, and sandbox. It can also be served through third-party platforms like Kserve, SageMaker, and Seldon Core. With the different endpoints available, users can host models in-house within Snorkel or utilize an HTTP port hosted on a Databricks workspace native to MLflow.

Figure 11. Deploying the use case application as an MLflow Package

How can Aimpoint Digital help add value using Snorkel?

With the domain and data science experts at Aimpoint Digital, we strive to help businesses reduce the need to label documents and extract text data manually. We are prepared to guide you in building innovative predictive AI applications by adapting large language and foundation models. Contact us through the form below to take advantage of Snorkel and learn about the different use cases we built in the NLP space.