CORRECT is a code reviewer recommendation tool that:
- Recommends appropriate code reviewers automatically by mining developers' contributions across projects
- Provides recommendation rationales that fit within developers' workflows
- Achieves over 90% accuracy in recommending reviewers based on library and technology experience
- Outperforms an existing technique (RevFinder) with 92.15% top-5 accuracy, 85.93% mean precision and 81.39% mean recall
- Performs similarly on open source projects with 85.20% top-5 accuracy, demonstrating effectiveness for public and private codebases
DevEX - reference for building teams, processes, and platforms
CORRECT-ToolDemo-ASE2016
1. CORRECT: CODE REVIEWER
RECOMMENDATION AT GITHUB FOR
VENDASTA TECHNOLOGIES
Mohammad Masudur Rahman, Chanchal K. Roy,
Jesse Redl$ and Jason A. Collins*
Department of Computer Science
University of Saskatchewan, Canada
Vendasta Technologies$, Canada, Google Inc.*, USA
31st IEEE/ACM International Conference on
Automated Software Engineering (ASE 2016), Singapore
2. PEER CODE REVIEW
2
Code review is a systematic examination of source
code for detecting bugs or defects and coding
rule violations.
Early bug detection
Stop coding rule violation
Enhance developer skill
Peer Code Review
3. PULL REQUEST (CODE CHANGES)
SUBMISSION AT GITHUB
3
Change title
Change
description
Member mention
feature
Whom should I choose?
Well, where there is a will, there is a way!
6. WHAT DO WE NEED?
Recommendation Tool
Recommends appropriate code reviewers
Recommends automatically
Does all heavy lifting (i.e., mining) for the developers.
Provides recommendation rationale
Fits within developer’s work flow
Advanced Features
Provides personalized recommendation
Provides optimized performance
Architecture
Platform-independent & scalable 6
14. LIBRARY EXPERIENCE & TECHNOLOGY
EXPERIENCE (ANSWERED: RQ1)
Metric Library Similarity Technology Similarity Combined Similarity
Top-3 Top-5 Top-3 Top-5 Top-3 Top-5
Accuracy 83.57% 92.02% 82.18% 91.83% 83.75% 92.15%
MRR 0.66 0.67 0.62 0.64 0.65 0.67
MP 65.93% 85.28% 62.99% 83.93% 65.98% 85.93%
MR 58.34% 80.77% 55.77% 79.50% 58.43% 81.39%
14
[ MP = Mean Precision, MR = Mean Recall, MRR = Mean Reciprocal Rank ]
Both library experience and technology experience are
found as good proxies, provide over 90% accuracy.
Combined experience provides the maximum performance.
92.15% recommendation accuracy with 85.93% precision and
81.39% recall.
Evaluation results align with exploratory study findings.
15. COMPARATIVE STUDY FINDINGS (ANSWERED:
RQ2)
CoRReCT performs better than the competing technique in all
metrics (p-value=0.003<0.05 for Top-5 accuracy)
Performs better both on average and on individual projects.
RevFinder uses PR similarity using source file name and file’s
directory matching
15
Metric RevFinder[18] CoRReCT
Top-5 Top-5
Accuracy 80.72% 92.15%
MRR 0.65 0.67
MP 77.24% 85.93%
MR 73.27% 81.39%
[ MP = Mean Precision, MR = Mean Recall,
MRR = Mean Reciprocal Rank ]
16. COMPARISON ON OPEN SOURCE PROJECTS
(ANSWERED: RQ3)
In OSS projects, CoRReCT also performs better than the
baseline technique.
85.20% accuracy with 84.76% precision and 78.73% recall,
and not significantly different than earlier (p-value=0.239>0.05
for precision)
Results for private and public codebase are quite close.
16
Metric RevFinder [18] CoRReCT (OSS) CoRReCT (VA)
Top-5 Top-5 Top-5
Accuracy 62.90% 85.20% 92.15%
MRR 0.55 0.69 0.67
MP 62.57% 84.76% 85.93%
MR 58.63% 78.73% 81.39%
[ MP = Mean Precision, MR = Mean Recall, MRR = Mean Reciprocal Rank ]
17. SUMMARY
CORRECT: A Recommendation Tool
Recommends appropriate code reviewers
Recommends automatically
Does all heavy lifting (i.e., mining) for the developers.
Provides recommendation rationale
Fits within developer’s work flow
Advanced Features
Provides personalized recommendation
Provides optimized performance
Architecture
Platform-independent & scalable 17
19. THANK YOU!! QUESTIONS?
19
Masud Rahman (masud.rahman@usask.ca)
CORRECT site (http://www.usask.ca/~masud.rahman/correct)
Acknowledgement: This work is supported by NSERC
20. THREATS TO VALIDITY
Threats to Internal Validity
Skewed dataset: Each of the 10 selected projects is
medium sized (i.e., 1.1K PR) except CS.
Threats to External Validity
Limited OSS dataset: Only 6 OSS projects considered—
not sufficient for generalization.
Issue of heavy PRs: PRs containing hundreds of files can
make the recommendation slower.
Threats to Construct Validity
Top-K Accuracy: Does the metric represent effectiveness
of the technique? Widely used by relevant literature
(Thongtanunam et al, SANER 2015)
20
Editor's Notes
Hello everyone.
My name is Mohammad Masudur Rahman
I am a PhD student from University of Saskatchewan, Canada.
Today, I am going to talk on code reviewer recommendation for Vendasta technologies.
I work with Dr. Chanchal Roy. The other co-authors of the paper are Jesse Redl from Vendasta, Canada, and Jason Collins from Google, USA.
The focus of my talk is code review.
It is a systematic examination/checking of source code
that identifies defects and coding standard violations in the code.
It helps in early bug detection—thus reduces cost.
It also ensures code quality by maintaining the coding standards.
And finally, it helps in knowledge dissemination among the developers.
However, in this work, we attempt to identify appropriate code reviewers for a given pull request.
And this is a significant challenge for the developers, as we found from working with the industry.
In GitHub, code changes are submitted as a pull request (PR).
Developer needs to create a pull request to submit the changes where they have to choose appropriate code reviewers.
Now, this is a UI GitHub provides for submitting a pull request.
Here goes the title, here goes description. It even allows you to mention a peer.
But the question is, whom should I choose as a code reviewer?
Well, where there is a will, there is an ad-hoc way.
One can directly go the file system to check for the previous authors who changed a file.
But, here is the reality.
The first file is changed by 9 developers.
The second one is developed by 6 developers.
Now, one can look fat those developers, and can try to get guess their appropriateness..
But, this is NOT a productive idea.
This goes out of hand, and nearly impossible when multiple files changed and multiple commits are involved.
Code reviewer selection is even more challenging for
Novice developers who are not aware of the skill matrix of other developers.
Distributed development where the developers do not meet face to face, let alone their skill set.
Study also showed that inappropriate assignment of reviewers cost 12 days extra on bug fixation, on average.
Now why is it so challenging? Because
this skill is not much well-defined, and cannot be easily estimated.
requires significant mining activities.
So, what we need to handle this challenge?
We need a recommendation tool
that can recommend appropriate code reviewers automatically
The will do all the heavy lifting for the developers, I mean all required mining.
It should provide a rationale why a developer is chosen.
It does fit within the existing work flow.
It should provide personalized and optimization feature such as result caching.
The architecture also has to be seamless and scalable.
So, we propose our tool called CORRECT.
It suggests appropriate reviewers based on external libraries included
and specialized technologies used in a pull request submitted for code review.
Now, lets walkthrough with our tool.
Once our tool is installed, it will show as an icon in Google Chrome.
Now, in Vendasta, developers generally create a branch, for example AA-2453, to work on any issue such as a bug fixation or feature request.
Once the work is done, they compare the branch with the develop/master branch
For example this URL is a compare URL, and it shows 1 commit is added where 6 files are changed.
Now, if requested, the tool suggests a ranked list of 5 code reviewers.
It also shows the rationale why a particular developer was suggested as code reviewer.
Once convinced, one can copy them using copy button and paste into the pull request body.
This mention will notify the corresponding developers.
Then one can submit the pull request for the review.
Now, lets check our recommendation accuracy against an existing PR.
For example, for PR# 1745 of SR system, our tool suggests 5 code reviewers
And the first two reviewers matched with original reviewers for this PR.
Again, it shows the rationale why the particular developer was suggested.
One can also clear the result, and try other PR.
Our tool also provides several advanced features.
1. Open authentication: The tool can make API request on behalf of the requesting user. This solves the API invocation limit issue.
For example, 5000 calls/hour for a developer. This is especially very needed for a company where several people are using the tool at the same time.
It also facilitates recommendation customization. Currently, we provide limited customization.
2. Parallel/Optimized processing: We use java multi-threading to optimize the computation and memory consumption.
We also use browser storage and server storage to provide caching facilities.
3. Client-server architecture: We also adopt a scalable and platform-independent architecture. Not only Google Chrome, but any client capable of HTTP call will be able to get the recommendation service.
This is our recommendation algorithm.
Once a new pull request R3 is created, we analyze its commits, then source files, and look for the libraries referred and the specialized technologies used. Thus, we get a library token list and a technology token list.
We combined both lists, and this list can be considered as a summary of libraries and technologies for the new pull request.
Now, we consider the latest 10 but closed pull requests, and collect their library and technology tokens.
It should be noted that the past requests contain their code reviewers.
Now, we estimate the similarity between the new and each of the past requests. We use cosine similarity score between their token list.
We add that score to the corresponding code reviewers.
This way, finally, we get a list of reviewers who got accumulated score for different past reviews.
Then they are ranked and top reviewers are recommended.
Thus, we use pull request similarity score to estimate the relevant expertise of code review.
The earlier study analyze line change history of source code, file path similarity of source files and review comments.
In short, they mostly considered the work experience of a candidate code reviewers within a single project only.
However, some skills span across multiple projects such as working experience with specific API libraries or specialized technologies.
Also in an industrial setting, a developer’s contribution scatter throughout different projects within the company codebase.
We thus consider external libraries and APIs included in the changed code and suggest more appropriate code reviewers.
Now, to be technically specific
The state-of-the-art considers two pull requests relevant/similar if they share source code files or directories.
On the other hand, we suggest that two pull requests are relevant/similar if they share the same external libraries and specialized technologies.
That’s the major difference in methodology and our core technical contribution.
This is how we answer the first RQ.
We see that both library similarity and technology similarity are pretty good proxies for code review skills.
Each of them provides over 90% top-5 accuracy.
However, when we combine, we get the maximum—92% top-5 accuracy.
The precision and recall are also greater than 80% which is highly promising according to relevant literature.
We then compare with the state-of-the-art –RevFinder.
We found that our performance is significantly better than theirs. We get a p-value of 0.003 for top-5 accuracy with Mann-Whitney U tests.
The median accuracy 95%. The median precision and median recall are between 85% to 90%
In the case of individual projects, our technique also outperformed the state-of-the-art.
We also experimented using 6 Open source projects, and found 85% Top-5 accuracy.
For the case of precision and recall, they are not significantly different from those with Vendasta projects.
For example, with precision, we get a p-value of 0.239 which is greater than 0.05.
To summarize, we propose a code reviewer recommendation tool
Just read them out…
Hands on is tomorrow.
You are cordially invited to the hands on.
That’s all I have to say today.
Thanks for your time. Now, I am ready to take questions.
There are a few threats to the validity of our findings.
-- The dataset from VA codebase is a bit skewed. Most of the projects are medium and only one project is big.
--Also the projects considered from open source domain is limited.
--Also, the technique could be slower for big pull requests.