Reliable machine learning

Engineering perspective at ML software

Arseny Kravchenko

Senior ML Engineer
Worked at Instrumental ⚙️, WANNABY 👞, Gett 🚕, Yandex 🔎, Wargaming 🎮

Contact me:

Not that cool story

Two years ago I introduced a bug in logger (!) that led to outage on several assemply lines and potential loss of $100k+.

Better testing practices could have prevented it.

What we will talk about

  • Non-ML specific tests
  • ML specific tests
  • Best practices for testing
  • Runtime checks
  • Software Engineer in Test mindset

What we will NOT talk about

  • ML model evaluation
  • ML metrics
  • Validation
  • Manual testing

ML and Software relationship

ML is based on software, so your projects should be tested in both persprectives.

Defensive programming

“The idea is based on defensive driving. In defensive driving, you adopt the mind-set that you're never sure what the other drivers are going to do. That way, you make sure that if they do something dangerous you won't be hurt. You take responsibility for protecting yourself even when it might be the other driver's fault. In defensive programming, the main idea is that if a routine is passed bad data, it won't be hurt, even if the bad data is another routine's fault.”

Steve McConnell. “Code Complete”

Reliable software requires (automatic) testing

Types of testing:

  • UI tests
  • Integration tests
  • End-to-end tests
  • Unit tests
  • Mutation tests
  • Many more

What are we avoiding with tests?

  • Explicit code errors (mostly for complicated logic)
  • Unintended changes in logic (e.g. during refactoring)

Reliable software requires different kind of tests

Write tests with different granularity
The more high-level you get the fewer tests you should have

image source

Example: microservice for oriented object detection

POST /detect

@app.route('/detect', methods=['POST'])
def process_image(request):
    image = storage.get_image(request.get('image_path'))
    coords, angle = detector(image)
    cropped = crop_image(image, coords)
    rotated = rotate_image(cropped, angle)
    result_path = storage.save_image(rotated)
    return {"success": True, "result": result_path}

Example: microservice for oriented object detection

class ImageStorage:
    @abstractmethod
    def get_image(self, path: str) -> np.ndarray:
        pass

    @abstractmethod
    def save_image(self, image: np.ndarray) -> str:
        pass

Example: microservice for oriented object detection

class Detector:
    @abstractmethod
    def __call__(self, img: np.ndarray) -> Tuple[Tuple[int, int, int, int], float]:
        """
        Returns object coords (x, y, w, h) and angle in degrees.
        """
        pass

Example: microservice for oriented object detection

def crop_image(img: np.ndarray, coords: Tuple[int, int, int, int]) -> np.ndarray:
    pass

def rotate_image(img: np.ndarray, angle: float) -> np.ndarray:
    pass

Unit tests

Fast, simple, isolated tests for a single method/function.
Required tools: pytest

Unit tests

def test_crop_image():
    # generate an image
    img = np.zeros((100, 100, 3))
    img[10:20, 10:20, :] = 1
    
    cropped = crop_image(img, (10, 10, 10, 10))
    assert cropped.mean() == 1

Unit tests

Should be as small and atomic as possible

What's wrong with this example?

def test_detect_rotated_image():
    img = cv2.imread('fixtures/car.jpg')

    for angle in (30, 45):
        rotated = rotate_image(img, angle=angle)
        detector = Detector()
        coords, detected_angle = detector(rotated)
        assert detected_angle == angle

Unit tests

Should be as small and atomic as possible

See @parametrize in pytest

@pytest.mark.parametrize('angle',([30, 45]))
def test_detect_rotated_image(angle):
    img = cv2.imread('fixtures/car.jpg')

    rotated = rotate_image(img, angle=angle)
    detector = Detector()
    coords, detected_angle = detector(rotated)
    assert detected_angle == angle

Fixtures

Fixtures are fixed inputs and expected outputs used in tests

def test_crop_image_with_fixture():
    img = cv2.imread('fixtures/full_img.jpg')
    cropped = crop_image(img, (10, 10, 10, 10))
    assert np.testing.assert_equal(cropped, cv2.imread('fixtures/cropped_img.jpg'))

Integration tests

Slower, affect several components, require more maintenance work
Required tools: pytest (may require unittest for mocks, patches etc.)

Integration tests

def test_save_image(client):
    request_data = {'image_path': '/path/to/image.jpg'}
    response = client.post('/detect', request_data)
    assert response['success'] is True
    assert response['result_path'].endswith('.jpg')

Mocks and patches

from unittest import mock

def get_angle():
    return 42

new_mock = mock.Mock(return_value=0)
with mock.patch('__main__.get_angle', new_mock):
    print(get_angle())
    print(new_mock.call_count)

Mocks and patches

In [1]: from unittest import mock
   ...:
   ...: def get_angle():
   ...:     return 42
   ...:
   ...: new_mock = mock.Mock(return_value=0)
   ...: with mock.patch('__main__.get_angle', new_mock):
   ...:     print(get_angle())
   ...:     print(new_mock.call_count)
   ...:
0
1

Mocks and patches

Useful for integration-like tests, e.g. one can mock:

  • external service;
  • long function (e.g. deep learning thing!);
  • messy IO operations

More complicated pytest stuff

You can check exception being raised and even make sure proper logs are written.

See 5 Advanced Pytest Tricks

More complicated pytest stuff

def test_rotate_with_incorrect_parameters():
    with pytest.raises(RotationParametersError):
        rotate_image(image, angle='OMG NOT A NUMBER')

See negative tests later in slides.

More complicated pytest stuff

def test_some_logs(caplog):
    do_magic()
    pattern = 'Doing magic'
    logs = [x for x in caplog.messages if pattern in x]
    assert logs == ['Doing magic for the greater good!"]

Warning: most likely your desire to build such a test means some software design is far from perfect.

I have a legacy code and no tests, what do I do?

  • Don’t panic 🧘
  • Gradually improve the codebase
  • Start with higher-level smoke tests and add new, lower-level tests
  • Always run your tests in CI. Tests that are not runnable on a regular basis tend to rote even faster than main codebase.

Smoke tests

Just scratch the surface with very high-level integration tests.
For script languages like Python, it can be a sanity check comparable to “it compiles!”.

Exotics: mutation tests

“Who will watch the watcher?”
Change the codebase (“mutate”) in a random way (e.g. replace + with -)
Run your tests
If tests are still ✅, it may be an indicator that tests are not great

Mutation Testing with Python

Exotics: property-based tests

Instead of testing particular values, let’s test the behavior (function properties).

Tests with hardcoded values => tests with randomized inputs => property-based tests.

Hypothesis is a Python library for property-based testing

📹 Property-based тестирование

Best practices: coverage

Coverage is a metric reflecting what % of codebase (usually linewise) is being used when full test set has been run.

test coverage = 
N of lines of code covered by tests / total N of lines of code

While coverage seems a good metric for start, it’s not perfect - one can easy build a test that covers full codebase but doesn’t actually test the result.
Thus coverage can be used as an informative metric, and not that useful for automatic decision if the code change is tested enough.

Best practices: add tests when bug happens

New tests are introduced after defect has been found.
Each new production defect highlights the lack of existing tests, so we need to add one to verify the defect being solved and will not happen again.

Best practices: CI

When your tests are not constantly rerunning (usually on CI like Github Actions, CircleCI etc.), they get rotten quickly enough with code changes.

Positive and negative tests

When positive tests check how everything is good (optimistic scenario), negative tests check various (mostly graceful) expected failures.
The more experienced the engineer is, more negative tests they usually build.

Positive and negative tests

“Developer tests tend to be "clean tests". Developers tend to test for whether the code works (clean tests) rather than test for all the ways the code breaks (dirty tests). Immature testing organizations tend to have about five clean tests for every dirty test. Mature testing organizations tend to have five dirty tests for every clean test. This ratio is not reversed by reducing the clean tests; it's done by creating 25 times as many dirty tests (Boris Beizer in Johnson 1994).”

Steve McConnell. “Code Complete”.

Flaky tests

Sometimes tests either fall or pass from run to run. These tests are called flaky.
Popular reasons:

  • Using random without seed being fixed;
  • Race conditions;
  • Incorrect design of IO operations with temporary data

Flaky tests

def test_randomly_cropped_image_with_fixture():
    img = cv2.imread('fixtures/big_img.jpg')
    detector = Detector()
    crop = random_crop(img, size=(224, 224))
    coords, angle = detector(crop)
    assert angle == 42    

Flaky tests

@pytest.mark.parametrize('angle', [(0, 42)])
def test_detect_image_from_file(angle: int):
    img = get_image_with_rotated_object(angle)
    image_path = '/tmp/image.png'
    cv2.imwrite(image_path, img)
    request = {'image_path': image_path}
    coords, detected_angle = detect_from_path(request)
    assert detected_angle == angle

ML specific software quality

In "regular" software, bugs tend to be more explicit.
In ML world, defects are easier to hide.

Is there a bug?

def extract_angle_from_warp_matrix(rotation_matrix: np.ndarray) -> np.float:
    # Extracts rotation angle (in degrees) from warp matrix
    return np.rad2deg(np.arctan2(-rotation_matrix[1, 0], rotation_matrix[1, 1]))

ML specific common bugs

  • Different preprocessing or feature engineering in for train/test or research/production components;
  • Numerical problems (e.g. overflow or nan after division by zero);
  • Incorrect axis and unexpected broadcasting;
  • Image processing: mess with x and y (or width and height);
  • Mismatch between libraries.

See also: 8 Deep Learning / Computer Vision Bugs And How I Could Have Avoided Them

Where is the bug?

from skimage.transform import AffineTransform
from skimage.data import astronaut

import cv2 
import numpy as np

image = astronaut()[:300, :, :]
transform = AffineTransform(rotation=np.deg2rad(45))

width, height, channels = image.shape
aligned = cv2.warpPerspective(image, transform.params, dsize=(width, height))

assert aligned.shape == image.shape, f"{aligned.shape} != {image.shape}"

Where is the bug?

from skimage.transform import AffineTransform
from skimage.data import astronaut

import cv2 
import numpy as np

image = astronaut()[:300, :, :]
transform = AffineTransform(rotation=np.deg2rad(45))

width, height, channels = image.shape
aligned = cv2.warpPerspective(image, transform.params, dsize=(width, height))

assert aligned.shape == image.shape, f"{aligned.shape} != {image.shape}"

AssertionError: (512, 300, 3) != (300, 512, 3)

ML specific: consistency tests

In data preprocessing, we often use functions to convert things back and forth.

E.g. np.rad2deg and np.deg2rad.

Thus we may require consistency like
x = inverse_fn(fn(x))

ML specific: invariance tests

One often needs a function to be invariant to input changes. E.g. we expect our detector working after we apply gamma correction to the image.
f(x) = f(aug(x)) + eps

ML specific: invariance tests

For an image recognition model, we might expect the model to be invariant to:

  • image rotation,
  • partial occlusion,
  • perspective shift,
  • lighting conditions,
  • weather artifacts (rain, snow, fog),
  • camera artifacts (ISO noise, motion blur),
  • ...

ML specific: invariance tests

For a sentiment analysis model on the following two sentences:

  • Mark was a great instructor.
  • Samantha was a great instructor.

We would expect that simply changing the name of the subject doesn't affect the model predictions.

ML specific: negation tests

We expect some changes to invert the result.

  • “I like the show” => positive sentiment
  • “I don’t like the show” => negative sentiment.

Thus we can artificially add some negation to subset and make sure labels are inverted as well.

ML specific: compare your environments

For example, your model runs on mobile and in cloud using different engines: coreml and onnxruntime.

It would be great to make sure both pipelines are equal:

def test_pipelines_match(image_fixture):
   image = cv2.imread(image_fixture)
   coreml_model = CoreMLModel()
   onnx_model = OnnxModel()
   coreml_bbox, coreml_angle = coreml_model.predict(image)
   onnx_bbox, onnx_angle = onnx_model.predict(image)

   np.testing.assert_allclose(coreml_bbox, onnx_bbox, rtol=0, atol=1e-4)
   np.testing.assert_allclose(coreml_angle, onnx_angle, rtol=0, atol=1e-4)

ML specific: unit tests with fixtures

We may want exact or close result in different scenarios.

from sklearn.datasets import load_iris

def test_exact_values():
    data = load_iris()
    model = train_model(data)
    
    score = model.predict_proba(data[0])
    assert score == 0.42

It validates we didn't change train_model unintentionally.

ML specific: directional expectation tests

Allow us to define a set of perturbations to the input which should have a predictable effect on the model output.

For example, if we had a housing price prediction model we might assert:

  • Increasing the number of bathrooms (holding all other features constant) should not cause a drop in price.
  • Lowering the square footage of the house (holding all other features constant) should not cause an increase in price.

ML specific: shallow network as fixture

Insted of using big network, one can have a dummy replacement with similar properties (e.g. input and output dimensions), so you could use it in integration tests.

E.g. this dummy Resnet50 backbone takes 224x224x3 image as input and return 1x2048 feature vector is output.

class DummyResnet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.model = torch.nn.Sequential(torch.nn.AdaptiveAvgPool2d(1),
                                         torch.nn.Conv2d(3, 2048, 1))

    def forward(self, x):
        return self.model(x).squeeze(-1).squeeze(-1)

ML specific: test end-to-end pipeline

from segmentation import train

def test_segmentation_end2end(some_params):
    # this test doesn't check values a lot, but checks the pipeline can run smoothly on micro batch
    with open(get_relative_path(__file__, '../config.yaml')) as fd:
        config = yaml.load(fd, Loader=yaml.SafeLoader)

    config['datasets_base_dir'] = get_relative_path(__file__, 'fixtures/')
    config['max_train_samples'] = 2
    config['max_val_samples'] = 1
    config['epochs'] = 5

    result = train(config)
    assert result['train_loss'] < .5
    assert result['val_loss'] < 1

Samples as tests

See Andrej Karpathy’s talk about using specific hard samples as tests for your ML models (e.g. make sure your detector doesn't miss partially occluded object).

Assertions and runtime checks

While assertions are great in testing, one should prefer conditional checks over assertions for runtime.

Why?

  • Optimization (python -O script.py) and other exotic interpreters
  • Verbose error message
  • The Zen of Python: Explicit is better than implicit

Assertions and runtime checks

if not input_data.shape == (100, 100, 3):
	logger.error(f"Got data shaped {input_data.shape} as input, expected (100, 100, 3) ")
	raise SpecificRuntimeError("Some proper message")

or

assert input_data.shape == (100, 100, 3) 

Assertions and runtime checks

assert 1 == 0
print("OK!")
➜  /tmp python -O check.py
OK!
➜  /tmp python check.py
Traceback (most recent call last):
  File "check.py", line 1, in <module>
    assert 1 == 0
AssertionError

What should we check in runtime?

  • Input data format (images: number of channels, BCHW vs BHWC vs HWC …)
  • Input data type (e.g. float32 vs uint8)
  • Input data range (e.g. 0..255 or 0..1 for images)
  • Any other domain-specific constants and invariants

What should we check in runtime?

  • Some property-like tests: e.g. image rotation function may check if rotation matrix determinant is equal to 1
def rotate_image(image, rotation_matrix):
    det = np.linalg.det(rotation_matrix)
    if np.abs(det - 1) > 1e-6:
        raise ValueError(f'Rotation matrix {rotation_matrix.tolist()} is scaled')
    ...

Various languages require various levels of runtime checks

Strong static typing catches many errors that can be missed in dynamic typing environment.

Python type hinting can slightly reduce the needs of runtime checks.

Side effect: improves code understanding for other engineers.

Various languages require various levels of runtime checks

➜  /tmp cat type_error.py
from typing import List

def sum_odds(numbers: List[int]):
    return sum([x for x in numbers if x % 2])

print(sum_odds([1, 2., 3, 4]))
➜  /tmp python type_error.py
4
➜  /tmp mypy type_error.py
type_error.py:6: error: List item 1 has incompatible type "float"; expected "int"
Found 1 error in 1 file (checked 1 source file)

Robustness vs correctness

As the video game and x-ray examples show us, the style of error processing that is most appropriate depends on the kind of software the error occurs in. These examples also illustrate that error processing generally favors more correctness or more robustness. Developers tend to use these terms informally, but, strictly speaking, these terms are at opposite ends of the scale from each other. Correctness means never returning an inaccurate result; returning no result is better than returning an inaccurate result. Robustness means always trying to do something that will allow the software to keep operating, even if that leads to results that are inaccurate sometimes.”

Steve McConnell. “Code Complete”.

Robustness vs correctness

“Safety-critical applications tend to favor correctness to robustness. It is better to return no result than to return a wrong result. The radiation machine is a good example of this principle.
Consumer applications tend to favor robustness to correctness. Any result whatsoever is usually better than the software shutting down. The word processor I'm using occasionally displays a fraction of a line of text at the bottom of the screen. If it detects that condition, do I want the word processor to shut down?”

Steve McConnell. “Code Complete”.

Debugging

It's good if your code will be sometimes defective and require maintenance. Thus be ready for some debugging and defect investigation.

Debugging: mental exercise

Imagine you're training a deep network on big dataset. The training takes 2 hours per epochs.
Every run after 40..60 epochs it fails with Out Of Memory.

How do you solve it?

Debugging: way to go

  1. Gather related data (logs, problematic samples...)
  2. Try to reproduce the problem locally
  3. Do binary search for localization
  4. Iterate with hypothesis what could go wrong

Logging

Logging is one part of a monitoring strategy. Good monitoring enables you to:

  1. Be alerted when things break
  2. Learn what's broken and why
  3. Inspect trends over long time frames
  4. Compare system behavior across different versions and/or experimental groups (e.g. AB testing)

Logging

  • If you're not sure whether you should log or not, log

You can always remove logs that you later realize are unnecessary, but you can't retroactively add them back.

  • Don't log, if it doesn't add new context

So try to inject some specific details in log message.

❌ logger.info("Rotated the image")
✅ logger.info(f"Rotated the image loaded from {image_path} for angle {angle}")

Monitoring

In regular software engineering, one tends to monitor your software is working - no errors, good response timing etc.

In ML engineering, one should also monitor the quality of your models and pipelines.

Monitoring

Regular software (say CRM system) rarely breaks with no code changes or significant input data changes.

ML software can be really sensitive to minor distribution changes (seasonal, trends, new cameras and microphones for visual/audio data...)

Monitoring

Two concusions:

  1. ML engineers should monitor input/output data distribution, model confidence etc.
  2. ML engineers should prefer robust (strongly regularized) models to overfitted when possible.

See also: Best Practices for Dealing with Concept Drift

Links for further reading (generic software)

📃 The Practical Test Pyramid
📃 Другая сторона медали или про недостатки юнит-тестирования
📚 Steve McConnell, “Code Complete”

Links for further reading (ML specific)

📃 PyTest for Machine Learning — a simple example-based tutorial
📃 Effective testing for machine learning systems
📃 Machine Learning Testing: Survey, Landscapes and Horizons
📹 Unit Testing for Data Scientists
📃 Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
📃 TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing

Thanks!

Shameless self-promo

📚 Me and Valerii Babushkin are writing a book "Principles of ML System Design".
More info: https://arseny.info/ml_design_book