Poracle: Testing Patches Under Preservation Conditions to Combat the Overfitting Problem of Program Repair

Elkhan Ismayilzada, Md Mazba Ur Rahman, Dongsun Kim, and Jooyong Yi

UNIST, South Korea, Kyungpook National University, South Korea

Test-based APR

Overfitting Problem of APR

APR Design Space

APR Design Space

APR Design Space

Existing Approach: Score-based Patch Classification

  • How to compute the score
    • Dynamic analysis (e.g. PATCH-SIM)
    • Machine learning (e.g. ODS)

Existing Approach: Score-based Patch Classification

  • Threshold problem
    • If too aggressive, it may miss correct patches
    • If too conservative, it may accept many incorrect patches

Existing Approach: Evidence-based Classification

  • New inputs can be generated using a fuzzer.
    • e.g., OPAD, Fix2Fit

Existing Approach: Evidence-based Classification

  • Oracle problem
    • How to determine unexpected behaviors other than crashes?

Things to Think About

  • The test suite was not originally prepared to conduct the patch validation of program repair.

Things to Think About

  • The test suite was not originally prepared to conduct the patch validation of program repair.
  • What if the developers are willing to adjust the test suite to conduct patch validation better?

Poracle: Testing Patches Under Preservation Conditions to Combat the Overfitting Problem of Program Repair

Preservation Condition

  • : preservation condition — typically simple
  • : output of test when is given to program
  • : output of test when is given to program

Preservation Condition Example

  • Math99 bug: A corner case is not handled properly.
public void testGcd() {
  try {
    MathUtils.gcd(Integer.MIN_VALUE, 0);
    fail("expecting ArithmeticException");
  } catch (ArithmeticException expected) { /* expected */ }
}

↓

public void testGcd(int i, int j) {
  try {
    boolean complement = !( (i==Integer.MIN_VALUE && j==0) || (i==0 && j==Integer.MIN_VALUE) );
    final long actual = MathUtils.gcd(i, j);
    preserveIf(complement, () −> new Long[] { actual });
  } catch (ArithmeticException expected) { 
    preserveIf(!complement, () −> new String[] { e.toString() });
  } catch (Exception e) {
    failToPreserve();
  }
}

Differential Fuzzing

  • We have developed a differential fuzzer on top of JQF.
  • Our fuzzer supports preserveIf and other APIs for preservation conditions.

Evaluation

  • Classification performance as compared to the state-of-the-art approaches
  • Usability of the preservation condition

Classification Performance

  • Dataset: 458 plausible patches generated by 15 APR tools for 77 bugs in Defects4J
  • Compared approaches:
    • Score-based: PATCH-SIM, BERT-LR, ODS
    • Evidence-based: OPAD
  • Metrics:
    • Precision:
    • Recall:

Classification Performance

Classification Performance

Comparison with ODS

  • ODS is too aggressive: rejecting many correct patches
  • The last thing we want: rejecting a correct patch

User Study

  • Participants: 66 undergraduate students taking the Software Engineering course at UNIST in 2022.
  • Task: Find a correct patch out of 10 given plausible patches.

Manual vs Semi-Automated

  • Question Bug ID Pattern
    Q1 Math73 CC (Complementary Case)
    Q2 Math105 EGA (Existing General Assertion)
    Q3 Math28 UE (Unexpected Exception)
    Q4 Lang58 RI (Reference Implementation)
  • Manual Group: Find a correct patch without using Poracle.

  • Semi-Automated Group: Find a correct patch using Poracle.

Correct Answer Ratio

Conclusion

  • We have suggested adding a semi-automated patch validation step to the existing APR pipeline.
  • Our experimental results and user study results show that Poracle can be a promising approach to reducing the overfitting problem of APR.

# Food for Thought - Ideally, incorrect patches are supposed to be identified by the test suite. ![width:1000px](./img/APR-pipeline.jpg) ---

# Things to Think About ![bg left:33% fit](./img/overfitting.jpg) - The large gap between the plausible patch space and the correct patch space suggests that the quality of the test suite is not good enough for patch validation. ---

# Difficulty of Test Generalization $\forall \vec{v}: T(\vec{v}) = \psi(\vec{v})$ - $\vec{v}$: inputs - $T(\vec{v})$: output of test $T$ when $\vec{v}$ is given - $\psi$: the oracle function ---

# Preservation Condition Example - Math105 bug: assertion --reused-> preservation condition ```java public void testSSENonNegative(double d1, ..., double d6) { try { double[]y={d1,d2,d3}; double[]x={d4,d5,d6}; SimpleRegression reg = new SimpleRegression(); for(inti=0;i<x.length;i++) { reg.addData(x[i], y[i]); } double ret = reg.getSumSquaredErrors(); // Original: assertTrue(ret >= 0.0); preserveIf(ret >= 0.0, () −> new Double[] { ret }); } catch (Exception e) { failToPreserve(); } } ``` # Preservation Condition Example - Lang58 bug: exploits a reference implementation ```java public void testLang300(int n, int m) { // NumberUtils.createNumber("1l"); // Original body // Test with a generalized input String s = "" + ((char) n) + ((char) m) + "l"; String actOut = ""; try { actOut = "" + NumberUtils.createNumber(s).longValue(); } catch (Exception e) { actOut = "Exception"; } // Use Long.valueOf as a reference String refOut = ""; try { refOut = "" + Long.valueOf(s); } catch (Exception e) { refOut = "Exception"; } preserveIf(actOut.equals(refOut), () −> new String[] { actOut }); } ```

# Patch Validation with Preservation Condition ![width:1500px](./img/workflow.jpeg) ---

- Patch reviewing cost reduction - The number of patches to be reviewed after filtering

# Patch Reviewing Cost Reduction ![bg left:33% fit](./img/APR-design-find-and-filter.jpg) - JAID returns a ranked list of plausible patches. - We applied Poracle to the obtained ranked list of plausible patches and compared the number of patches to be reviewed before and after filtering. --- # Patch Reviewing Cost Reduction ![width:1200px](./img/cmp-cost.jpg) ---

- For each question, participants were divided into two groups.

--- # Ablation Study ![width:700px](./img/CoincidentallyRejected.jpg) # Example - Failing test for Math95 of Defects4J ```java public void testSmallDegreesOfFreedom() { FDistributionImpl fd = new FDistributionImpl(1.0, 1.0); double p = fd.cumulativeProbability(0.975); double x = fd.inverseCumulativeProbability(p); assertEquals(/* expected output */ 0.975, x, /* delta */ 1e-5); } ``` --- # Example - Generalizing the failing test ```java public void testSmallDegreesOfFreedom() { FDistributionImpl fd = new FDistributionImpl(1.0, 1.0); double p = fd.cumulativeProbability(0.975); double x = fd.inverseCumulativeProbability(p); assertEquals(/* expected output */ 0.975, x, /* delta */ 1e-5); } ``` ↓ ```java public void testSmallDegreesOfFreedom(double d1, double d2, double d3) { FDistributionImpl fd = new FDistributionImpl(d1, d2); double p = fd.cumulativeProbability(d3); double x = fd.inverseCumulativeProbability(p); assertEquals(/* expected output */ ________, x, /* delta */ 1e-5); } ``` --- # Preservation Condition Example - Math95 bug: An unexpected exception occurs. ```java public void testSmallDegreesOfFreedom(double d1, double d2, double d3) { try { FDistributionImpl fd = new FDistributionImpl(d1, d2); double p = fd.cumulativeProbability(d3); double x = fd.inverseCumulativeProbability(p); preserveIf(/* preservation condition */ true, /* outputs to compare */ () -> new Double[] {x}) } catch (Exception e) { failToPreserve(); } } ``` ![width:900px](./img/preserveIf.png) --- # Classification Performance ![width:700px](./img/cmp-bert-lr.jpg) - BERT-LR: Haoye Tian et al., "Evaluating representation learning of code changes for predicting patch correctness in program repair", ASE 2020 --- # Developer Patches ![width:700px](./img/developer-patches.jpg) --- # Correct Answer Ratio | Top 50% Students | Bottom 50% Students | |:---:|:---:| | ![width:600px](./img/high_exp_scores.jpg) | ![width:600px](./img/low_exp_scores.jpg) | --- # Manual Time Cost | All students | Students who submitted correct answers | |:---:|:---:| | ![width:600px](./img/overall_time.jpg) | ![width:600px](./img/overall_time_only_correct.jpg) | --- # Sentiment | All students | Top 50% students | |:---:|:---:| | ![width:600px](./img/poracle_manual_experience_bar.jpg) | ![width:600px](./img/poracle_manual_experience_bar_grade.jpg) |