import { ReferenceLink, InlineLink } from '../components/article/ReferenceBox';

function ImbalancedMachineLearning() {
  return (
    <div className="article-content">
        <p className="article-content">
            I've spent the last two months redesigning and rebuilding our machine learning pipeline. At work, we're particularly solving the need of making it easier to detect disease probability based on well-formed, tabular patient phenotypic responses for rare diseases. We do this by building a classifier specific to each target disease.
        </p>
        <p className="article-content">
            Given that we work with extremely rare diseases, it can be really hard to gather a volume of positive samples that represents an ideal dataset distribution. More so, it's also expensive to validate data points, given the current cost of genetic testing, so we tend to have a majority of unconfirmed (though likely false) data points.
        </p>
        <p className="article-content">
            In this article, I'm mainly discussed the challenges of training a binary classifier with an imbalanced, positive-unlabeled dataset.
        </p>
        <h2 className='article-content'>Data Cleaning</h2>
        <p className="article-content">
            My background is in software engineering and computer science, so the world of data science is a degree foreign. Given that, my mind tends to focus on more engineering questions. What I consider:
            <ul className="article-content">
                <li>How will we pull this data in a reproducible way?</li>
                <li>Where should we maintain training data associated with experiments?</li>
                <li>How should we surface metrics for training runs?</li>
            </ul>
        </p>
        <p className="article-content">
            But when it comes to training an accurate model, these questions are less relevant. During my training iterations, I found by far the biggest gains were made by digging into the data and finding insights. What I should consider:
            <ul className="article-content">
                <li>Are there fake responses I needed to filter out?</li>
                <li>How should I handle duplicates?</li>
                <li>Did a certain field need to be interpreted as a float?</li>
                <li>Should I fill the empty values in that column?</li>
                <li>Are there features I should remove from the input schema, or features I should add?</li>
            </ul>
        </p>
        <p className="article-content">
            I spent a lot of time optimizing the platform, when I should have been paying attention to the data and the parameters.
        </p>
        <p className="article-content">
            You have to be extremely paranoid about the data. Learn it inside and out before you start building your classifier. As the adage goes, garbage in-garbage-out; if your training set is inaccurate, then so too will your result set be.
        </p>
        <h2 className='article-content'>Imbalanced, PU Datasets</h2>
        <p className="article-content">
            Imbalanced datasets can be finnicky to work with. They are defined by having a few samples of one set, and a massive sample quantity of the other set. A ratio of 1000:1 or greater would be an imbalanced dataset. Examples of imbalanced datasets in the real world:
        </p>
        <ul className="article-content">
            <li className="article-content">Fraudulent purchases in an online marketplace</li>
            <li className="article-content">Identifying a rare species of plant</li>
            <li className="article-content">Product manufacturing errors in a factory</li>
        </ul>
        <p className="article-content">
            In our case, we have few positive labels, and many negative (or unconfirmed) labels. The problem of positive-unconfirmed labeling is called PU learning.
        </p>
        <p className="article-content">
            Having an imbalanced dataset can lead to a poorly-performing model because the model might not have enough information to build an interpretation of what the positive samples look like, since they're all weighted equally and might be somewhat divergent in presentation. You can read more about that on <InlineLink reference={new ReferenceLink("this page about imbalanced datasets (Google)", "https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data")}/>.
        </p>
        <p className="article-content">
            There are two main techniques I want to highlight:
        </p>
        <ul className="article-content">
            <li className="article-content">Data Augmentation</li>
            <li className="article-content">Semi-Supervised Learning</li>
        </ul>
        <p className="article-content">
            Whatever dataset manipulation techniques you employ, remember to apply those same techniques to the entire training set, and to the test set as well when running your predictions.
        </p>
        <h3 className='article-content'>Data Augmentation</h3>
        <p className="article-content">
            The first approach to wrangling your imbalanced, PU dataset is to add more data. This can sound a bit shady, but it ostensibly works. A simple approach here is to simply upsample your positive data points, or downsample the negatives, depending on your particular needs and preferences.
        </p>
        <p className="article-content">
            Other approaches use methods of approximation to create similar but not equivalent datapoints to pad the dataset. For example, <InlineLink reference={new ReferenceLink("SMOTE (Synthetic Minority Oversampling Technique)", "https://arxiv.org/pdf/1106.1813.pdf")} /> is a common practical application of this methodology.
        </p>
        <p className="article-content">
            In this realm, I also made use of models that allowed me to scale up my positive samples during training. Some models will also allow you to pass weights to your training function or to the classifier itself that allows you to scale the importance of particular samples in your dataset.
        </p>
        <h3 className='article-content'>Semi-Supervised Learning</h3>
        <p className="article-content">
            The second approach here that might work well is to use semi-supervised learning. This is particular relevant to the problem of the positive-unlabeled dataset, in which one might have many unlabeled points and few labeled ones. In this technique, we are able to make better use of the unlabeled datapoints which might still provide value if we can infer their proximity to positive data points.
        </p>
        <p className="article-content">
            In the semi-supervised methodology, we first apply a clustering regimen in order to group similar data points together. We have three labels: 0, 1, and -1. -1 denotes unknown labels, while 1 is confirmed positive, and 0 is confirmed negative. We pass this labeled data to a clustering model (such as KNN), which then outputs new labels for us using a <InlineLink reference={new ReferenceLink("label propagation technique", "https://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelPropagation.html")} />. I call these inferred labels.
        </p>
        <p className="article-content">
            Now, to train our new machine learning model, we can pass the inferred labels to it, rather than the actual labels. 
        </p>
        <p className="article-content">
            Evaluation for the veracity of the predictions is tricky for this type of technique. Because the data isn't perfectly true with known results, it's difficult to say for certain that we want to look for performance on the actual labels or the confirmed labels. I've generally used the performance on the actual labels, as those represent our picture of reality.
        </p>
        <h2 className='article-content'>Evaluating Models</h2>
        <p className="article-content">
            Model evaluation in the case of imbalanced datasets can have some nuance. Generally, <InlineLink reference={new ReferenceLink("f-scores", "https://en.wikipedia.org/wiki/F-score")}/> are used to assess model performance; they're a value that combines precision and recall to create an overall performance metric. However, depending on your problem statement and missions, you might either be preferential towards precision or recall. Is it more costly for you to miss a positive case, or to investigate a negative case? To accommodate for that, you can pass a beta parameter to the function to boost a model with better recall. An f-score function with a beta parameter looks like this:
            <div className='article-content center'>
                <math className='article-content'>
                    <mrow>
                        <mfrac>
                            <mrow>
                                <mi>precision</mi>
                                <mo>&#x22C5;</mo>
                                <mi>recall</mi>
                            </mrow>
                            <mrow>
                                <mrow>
                                    <mn>(</mn>
                                    <mn>
                                        <msup>
                                            <mrow>
                                                <mo>&#946;</mo>
                                            </mrow>
                                            <mrow>
                                                <mn>2</mn>
                                            </mrow>
                                        </msup>
                                    </mn>
                                    <mo>&#x22C5;</mo>
                                    <mi>
                                        precision
                                    </mi>
                                    <mn>)</mn>
                                    <mo>+</mo>
                                    <mi>recall</mi>
                                </mrow>
                            </mrow>
                        </mfrac>
                        <mo>&#x22C5;</mo>
                        <mrow>
                            <mn>(</mn>
                            <msup>
                                <mrow>
                                    <mn>1</mn>
                                    <mo>+</mo>
                                    <mo>&#946;</mo>
                                </mrow>
                                <mrow>
                                    <mn>2</mn>
                                </mrow>
                            </msup>
                            <mn>)</mn>
                        </mrow>
                    </mrow>
                </math>
            </div>
        </p>
        <p className="article-content">
            Other evaluation metrics you can look for are the AUC and probability distribution. A highly accurate model might have an AUC of 1 and an imbalanced probability distribution, with most negative data points scored close to 0. The Brier Score can also help evaluate your error rate.
        </p>
        <p className="article-content">
            If you use the semi-supervised learning technique on your dataset, evaluation can be a trickier. It may not be clear whether to evaluate your performance on inferred labels or actual labels. In our case, it made sense to evaluate on the actual labels (where unlabeled is assumed negative), since the higher likelihood is that the data points are false. For other datasets, one might be preferential to the inferred labels.
        </p>
        <h2 className='article-content'>Dev Ops</h2>
        <p className="article-content">
            In a highly data-intensive world that's changing constantly, it's important to keep your model in sync with the latest information. Without re-training, your models are liable to experience drift as the real world data deviates from your training set.
        </p>
        <p className="article-content">
            To achieve this, it's critical to be able to train models with the latest data, and then deploy models with minimal friction, so that we can iterate and improve on them over time. To that end, we must have a pipeline for continuous learning and deployment. As your pipeline collects more information, your models can get more accurate.
        </p>
        <p className="article-content">
            To track our experiment runs, we use a self-hosted instance of <InlineLink reference={new ReferenceLink("MLFlow", "https://mlflow.org/")} /> running on an ECS Fargate instance. There are still a lot of tools out there, such as <InlineLink reference={new ReferenceLink("AWS Sagemaker", "https://aws.amazon.com/sagemaker/")}/>, <InlineLink reference={new ReferenceLink("SeldonML", "https://www.seldon.io/")}/>, <InlineLink reference={new ReferenceLink("MLServer", "https://mlserver.readthedocs.io/en/stable/")}/>, <InlineLink reference={new ReferenceLink("Managed MLFlow", "https://www.databricks.com/product/managed-mlflow")}/> on Databricks. Using a more managed solution would make sense for a team that doesn't have resources to manages their own services.
        </p>
        <h2 className='article-content'>Practical Implications</h2>
        <p className="article-content">
            Machine learning models are often being used to affect the decisions we're making out in the real world. We need resilient debiasing techniques for validation of our experimentation, especially in an early learning phase. To that end, you should sample candidates below your threshold.
        </p>
        <p className="article-content">
            Your training data is everything. Before you even touch any code setting up any kind of classifier or any parameters, you need to clean your dataset and make sure the data is representative of reality. Human systems and structures are inundated with implicit bias.
        </p>
        <p className="article-content">
            For example, there was a <InlineLink reference={new ReferenceLink("study done on a Dutch program", "https://www.wired.com/story/welfare-algorithms-discrimination/")}/> that was used to decide whether or not to offer welfare benefits to people. In their dataset, there were far more women whose applications had been audited than men. By its very nature, a model trained solely on this data would associate the feature label of "female" with being likely indicator for committing welfare fraud, even if men and women have similar rates of committing fraud.
        </p>
        <p className="article-content">
            All that to say, it's important to always be mindful of the real-world applications of the decision-making algorithms we deploy. We're at an inflection point, where machines are going to be making an ever-increasing volume of decisions on behalf of humans, and we need to be conscientious of how we navigate this next industrial revolution.
        </p>
    </div>
  );
}

export default ImbalancedMachineLearning;
