YOLO Discrepancy: Python Vs. IOS CoreML Results

Nov 7, 2025 by Admin 48 views

Hey guys! Have you ever run into a situation where your machine learning model performs flawlessly in Python but throws a curveball when deployed on iOS using CoreML? Today, we're diving deep into a peculiar issue reported by a user who experienced different results between Python and iOS when using Ultralytics YOLO. Let's break down the problem, explore the steps taken, and discuss potential reasons for this discrepancy.

The Curious Case of Divergent Predictions

The user trained a model using Ultralytics YOLO, exported it to CoreML format (.mlpackage), and initially found that the inference results from Python, whether using the original (.pt) model or the exported (.mlpackage) model, were perfectly aligned. This is the expected behavior, and it's a great first step in ensuring your model is consistent. However, the plot thickened when the same .mlpackage model was deployed on iOS using the official yolo-ios-app example. The results? Completely different. Uh oh!

Here's a quick summary of the situation:

✅ Python: Both best.pt and best.mlpackage produce identical predictions.
❌ iOS (yolo-ios-app): The same .mlpackage yields significantly different results.
🖼️ The Setup: The tests were conducted using the same image from the photo library to ensure a fair comparison.

Replicating the Experiment: Steps Taken

To better understand the issue, the user meticulously outlined the steps they took. Let's walk through them:

1️⃣ Training the YOLO Model

The first step, as with any machine learning project, was training the YOLO model. The user employed the following Python code snippet:

from ultralytics import YOLO

model = YOLO("yolo11n.yaml")  # or yolo11n.pt
results = model.train(
    data="datasets/version12/data.yaml",
    epochs=50,
    imgsz=640,
    batch=-1,
    workers=1,
    perspective=0.01,
    translate=0,
    scale=0,
    mosaic=0,
    erasing=0,
    pretrained=True,
    rect=True,
)

In this stage, the user leveraged Ultralytics YOLO, a popular framework for object detection, initializing the model with yolo11n.yaml (or yolo11n.pt). They trained the model for 50 epochs using a custom dataset specified in data.yaml. Key parameters such as imgsz (image size), batch (batch size), and various data augmentation techniques (perspective, translate, scale, mosaic, erasing) were configured to optimize the training process. The use of pretrained=True suggests leveraging transfer learning, which can significantly speed up training and improve model performance.

2️⃣ Exporting to CoreML

Once the model was trained, the next step was to export it to CoreML format, which is essential for running the model efficiently on Apple devices. The user employed the following code:

from ultralytics import YOLO

def export_and_zip_yolo_models():
    imgsz = 640
    model_name = "detect/train8/weights/best"
    model = YOLO(f"{model_name}.pt")
    model.export(format="coreml", imgsz=imgsz, conf=0.4, iou=0.3, nms=False)

Here, the trained model (best.pt) is loaded, and the export function is used to convert it to CoreML format (.mlpackage). Important parameters like imgsz (image size), conf (confidence threshold), and iou (Intersection over Union threshold) are specified. Notably, nms=False indicates that Non-Maximum Suppression (NMS) is disabled during the export. This is a crucial detail, as NMS is a post-processing step that eliminates redundant bounding boxes, and differences in its implementation can lead to discrepancies in results.

3️⃣ Python Validation: .pt vs .mlpackage

To ensure the CoreML export was successful and didn't introduce any issues, the user compared the predictions from the original .pt model and the exported .mlpackage model within Python. This is a vital step in the debugging process, and they reported that the results were identical.

from ultralytics import YOLO

device = 'cuda' if torch.cuda.is_available() else \
         'mps' if torch.backends.mps.is_available() else 'cpu'

model_pt = YOLO("detect/train8/weights/best.pt")
model_ml = YOLO("detect/train8/weights/best.mlpackage")

results_ml = model_ml.predict(
    source,
    imgsz=640,
    conf=0.4,
    device=device,
    agnostic_nms=True,
    iou=0.3,
)

results_pt = model_pt.predict(
    source,
    imgsz=640,
    conf=0.4,
    device=device,
    agnostic_nms=True,
    iou=0.3,
)

# Both results are identical (same boxes/classes/scores)

This code snippet demonstrates loading both the .pt and .mlpackage models and using them to make predictions on the same input (source). The results are then compared, and the user confirmed that the bounding boxes, class labels, and confidence scores were the same. This strongly suggests that the CoreML export process itself was not the source of the problem.

4️⃣ iOS Implementation: The Discrepancy Emerges

The final step was to run the same .mlpackage model on iOS using the yolo-ios-app example. This is where the unexpected behavior surfaced. The predictions on iOS were significantly different from those obtained in Python, even though the same input image, image size, confidence threshold, and IoU threshold were used.

The Swift code snippet used for inference on iOS is as follows:

struct ContentView: View {
    @State private var selectedItem: PhotosPickerItem?
    @State private var inputImage: UIImage?
    @State private var yoloResult: YOLOResult?

    let yolo = YOLO("best", task: .detect) // best.mlpackage

    var body: some View {
        VStack {
            if let annotated = yoloResult?.annotatedImage {
                Image(uiImage: annotated).resizable().scaledToFit()
            } else if let input = inputImage {
                Image(uiImage: input).resizable().scaledToFit()
            } else {
                Text("No image selected")
            }

            PhotosPicker(selection: $selectedItem, matching: .images) {
                Text("Pick Photo")
                    .padding().foregroundColor(.white)
                    .background(Color.blue).cornerRadius(8)
            }
            .onChange(of: selectedItem) { newItem in
                Task {
                    guard let newItem = newItem,
                          let data = try? await newItem.loadTransferable(type: Data.self),
                          let uiImage = UIImage(data: data) else { return }

                    let correct = getCorrectOrientationUIImage(uiImage: uiImage)
                    inputImage = correct
                    yoloResult = yolo(correct) // Results differ from Python
                    print(yoloResult as Any)
                }
            }
        }
        .padding()
    }
}

func getCorrectOrientationUIImage(uiImage: UIImage) -> UIImage {
    var newImage = UIImage()
    let ciContext = CIContext()
    switch uiImage.imageOrientation.rawValue {
    case 1:
        guard let oriented = CIImage(image: uiImage)?.oriented(.down),
              let cg = ciContext.createCGImage(oriented, from: oriented.extent) else { return uiImage }
        newImage = UIImage(cgImage: cg)
    case 3:
        guard let oriented = CIImage(image: uiImage)?.oriented(.right),
              let cg = ciContext.createCGImage(oriented, from: oriented.extent) else { return uiImage }
        newImage = UIImage(cgImage: cg)
    default:
        newImage = uiImage
    }
    return newImage
}

The code uses SwiftUI to create a simple app that allows the user to select an image from their photo library and run it through the YOLO model. The YOLO class is initialized with the best.mlpackage model. The onChange(of: selectedItem) modifier triggers the inference process when a new image is selected. The getCorrectOrientationUIImage function ensures that the image orientation is handled correctly, which is a common pitfall in iOS image processing. The yolo(correct) call performs the object detection, and the results are stored in the yoloResult variable.

The Heart of the Matter: Discrepancies in Results

The user reported that the bounding boxes, class labels, and confidence scores obtained on iOS were significantly different from those in Python. This is a major issue, as it undermines the reliability of the model when deployed on a mobile platform. The fact that the same image, image size, confidence threshold, and IoU threshold were used in both environments makes the discrepancy even more perplexing.

Unraveling the Mystery: Potential Causes

So, what could be causing these divergent results? Let's brainstorm some potential culprits:

1. Non-Maximum Suppression (NMS) Implementation

As mentioned earlier, NMS is a crucial post-processing step in object detection. It eliminates overlapping bounding boxes, keeping only the ones with the highest confidence scores. The user explicitly disabled NMS during the CoreML export (nms=False). This suggests that the NMS might be handled differently in the yolo-ios-app compared to the Python Ultralytics YOLO implementation. CoreML itself has NMS operations, and the YOLO iOS app might be using those. If the parameters or logic differ, it could lead to different final results.

To investigate this, you could try these steps:

Enable NMS during CoreML export: Re-export the model with nms=True and see if the iOS results align better with Python. This would mean NMS is handled by CoreML.
Inspect NMS parameters in iOS: Examine the yolo-ios-app code to see how NMS is being applied. Are the confidence and IoU thresholds the same as in Python?

2. Image Preprocessing Differences

Even seemingly minor differences in image preprocessing can impact the results of a deep learning model. This includes things like:

Image resizing and scaling: How are the images resized to the target imgsz (640 in this case) in Python and iOS? Are the aspect ratios preserved? Are the images cropped or padded? Different resizing algorithms can lead to subtle variations in the input to the model.
Pixel value normalization: Are the pixel values normalized in the same way in both environments? For example, are they scaled to the range [0, 1] or [-1, 1]? Any deviation here can affect the model's output.
Color space conversion: Is there any color space conversion happening (e.g., RGB to BGR)? If so, are the conversions identical?

To troubleshoot this, you could:

Log image preprocessing steps: Add logging to both the Python and iOS code to print out the image dimensions and pixel value ranges after each preprocessing step.
Visualize preprocessed images: Save the images after preprocessing in both environments and visually compare them to identify any discrepancies.

3. CoreML Conversion Quirks

While the user confirmed that the .pt and .mlpackage models produced identical results in Python, there might still be subtle issues introduced during the CoreML conversion process that only manifest on iOS. CoreML has its own set of operators and optimizations, and certain operations might be handled differently than in the original PyTorch model.

Here are some ways to investigate this:

CoreMLTools version: Ensure that you are using a compatible version of coremltools for the Ultralytics YOLO version you are using. Incompatibilities can sometimes lead to unexpected behavior.
Inspect the CoreML graph: Use tools like Netron to visualize the CoreML graph and compare it to the original PyTorch graph. Look for any unexpected changes or simplifications.
Simplify the model: Try exporting a simpler version of the YOLO model (e.g., a smaller variant or with fewer layers) to see if the issue persists. This can help narrow down the source of the problem.

4. Numerical Precision

Deep learning models often involve a large number of floating-point operations. Subtle differences in numerical precision between different hardware and software platforms can sometimes accumulate and lead to noticeable differences in results. This is especially true for complex models like YOLO.

To explore this, you could:

CoreML quantization: CoreML supports quantization, which reduces the precision of the model's weights and activations. While this can improve performance and reduce model size, it can also affect accuracy. Experiment with different quantization settings to see if it impacts the discrepancy.
Floating-point settings: Check if there are any platform-specific floating-point settings that might be influencing the results. This is less likely to be the primary cause, but it's worth considering.

5. iOS-Specific Issues

Finally, there might be issues specific to the iOS environment that are contributing to the problem.

Memory constraints: iOS devices have limited memory compared to desktop computers. If the model is too large or the inference process is memory-intensive, it could lead to performance degradation or even incorrect results. Try monitoring memory usage on the iOS device during inference.
Threading and concurrency: The yolo-ios-app might be using different threading or concurrency mechanisms than the Python code. If there are any race conditions or synchronization issues, it could lead to non-deterministic results.

Wrapping Up: A Call to Action

This discrepancy between Python and iOS results is a classic example of the challenges involved in deploying machine learning models in the real world. While the initial training and validation steps might seem straightforward, ensuring consistent performance across different platforms requires careful attention to detail.

The user's detailed report and systematic approach to debugging are commendable. By breaking down the problem into smaller steps and meticulously comparing the results at each stage, they've set the stage for a successful resolution.

If you've encountered a similar issue or have any insights into the potential causes, please share your thoughts in the comments below! Let's work together to unravel this mystery and ensure that our YOLO models perform as expected, no matter where they're deployed. Remember, the devil is often in the details, and a thorough investigation is the key to success in the world of machine learning! Happy debugging, folks! 🚀