1. CORRECT: CODE REVIEWER
RECOMMENDATION IN GITHUB BASED ON
CROSS-PROJECT & TECHNOLOGY
EXPERIENCE
Mohammad Masudur Rahman, Chanchal K. Roy and
Jason A. Collins*
Department of Computer Science
University of Saskatchewan, Canada, Google Inc., USA*
38th International Conference on Software Engineering
(ICSE 2016), Austin, TX, USA
3. CODE REVIEW
Formal inspection
Peer code review
Modern code review
Code review is a systematic
examination of source code for
detecting bugs or defects and
coding rule violations.
3
Early bug detection
Stop coding rule violation
Enhance developer skill
5. EXISTING LITERATURE
Line Change History (LCH)
ReviewBot (Balachandran, ICSE 2013)
File Path Similarity (FPS)
RevFinder (Thongtanunam et al, SANER 2015)
FPS (Thongtanunam et al, CHASE 2014)
Tie (Xia et al, ICSME 2015)
Code Review Content and Comments
Tie (Xia et al, ICSME 2015)
SNA (Yu et al, ICSME 2014)
5
Issues & Limitations
Mine developer’s contributions from
within a single project only.
Library & Technology Similarity
Library
Technology
6. OUTLINE OF THE TALK
6
Vendasta codebase
CORRECT
Evaluation using
VendAsta code base
Evaluation using
Open Source Projects
Conclusion
Comparative
study
Exploratory study 3 Research questions
7. EXPLORATORY STUDY ( 3 RQS)
1: How frequently do the commercial software
projects reuse external libraries from within the
codebase?
2: Does the experience of a developer with such
libraries matter in code reviewer selection by other
developers?
3: How frequently do the commercial projects adopt
specialized technologies (e.g., taskqueue,
mapreduce, urlfetch)?
7
8. DATASET: EXPLORATORY STUDY
8
Each project has at least 750 closed pull requests.
Each library is used at least 10 times on average.
Each technology is used at least 5 times on average.
10 utility libraries
(Vendasta)
10 commercial projects
(Vendasta)
10 Google App Engine
Technologies
9. LIBRARY USES IN COMMERCIAL PROJECTS
(ANSWERED: EXP-RQ1 )
Empirical library usage frequency in 10 projects
Mostly used: vtest, vauth, and vapi
Least used: vlogs, vmonitor
9
10. LIBRARY USES IN
PULL REQUESTS (ANSWERED: EXP-RQ2)
30%-70% of pull requests used at least one of the 10 libraries
87%-100% of library authors recommended as code reviewers in
the projects using those libraries
Library experience really matters!
10
% of PR using selected libraries % of library authors as code reviewers
11. TECHNOLOGY USES
IN PROJECTS (ANSWERED: EXP-RQ3)
Empirical technology usage frequency in top 10
commercial projects
Champion technology: mapreduce
11
12. TECHNOLOGY USES IN PULL REQUESTS
(ANSWERED: EXP-RQ3)
20%-60% of the pull requests used at least one of the
10 specialized technologies.
Mostly used in: ARM, CS and VBC
12
13. SUMMARY OF EXPLORATORY FINDINGS
13
About 50% of the pull requests used one or more of the
selected libraries. (Exp-RQ1)
About 98% of the library authors were later
recommended as pull request reviewers. (Exp-RQ2)
About 35% of the pull requests used one or more
specialized technologies. (Exp-RQ3)
Library experience and Specialized technology
experience really matter in code reviewer
selection/recommendation
14. CORRECT: CODE REVIEWER RECOMMENDATION
IN GITHUB USING CROSS-PROJECT &
TECHNOLOGY EXPERIENCE
14
17. EVALUATION OF CORRECT
Two evaluations using-- (1) Vendasta codebase (2)
Open source software projects
17
1: Are library experience and technology experience useful
proxies for code review skills?
2: Does CoRReCT outperform the baseline technique for
reviewer recommendation?
3: Does CoRReCT perform equally/comparably for both
private and public codebase?
4: Does CoRReCT show bias to any of the development
frameworks
18. EXPERIMENTAL DATASET
Sliding window of 30 past requests for learning.
Metrics: Top-K Accuracy, Mean Precision (MP), Mean
Recall (MR), and Mean Reciprocal rank (MRR). 18
10 Python projects 2 Python, 2 Java &
2 Ruby projects
13,081 Pull requests 4,034 Pull requests
Code reviews Code reviewers
Gold set
19. LIBRARY EXPERIENCE & TECHNOLOGY
EXPERIENCE (ANSWERED: RQ1)
Metric Library Similarity Technology Similarity Combined Similarity
Top-3 Top-5 Top-3 Top-5 Top-3 Top-5
Accuracy 83.57% 92.02% 82.18% 91.83% 83.75% 92.15%
MRR 0.66 0.67 0.62 0.64 0.65 0.67
MP 65.93% 85.28% 62.99% 83.93% 65.98% 85.93%
MR 58.34% 80.77% 55.77% 79.50% 58.43% 81.39%
19
[ MP = Mean Precision, MR = Mean Recall, MRR = Mean Reciprocal Rank ]
Both library experience and technology experience are
found as good proxies, provide over 90% accuracy.
Combined experience provides the maximum performance.
92.15% recommendation accuracy with 85.93% precision and
81.39% recall.
Evaluation results align with exploratory study findings.
20. COMPARATIVE STUDY FINDINGS (ANSWERED:
RQ2)
CoRReCT performs better than the competing technique in all
metrics (p-value=0.003<0.05 for Top-5 accuracy)
Performs better both on average and on individual projects.
RevFinder uses PR similarity using source file name and file’s
directory matching
20
Metric RevFinder[18] CoRReCT
Top-5 Top-5
Accuracy 80.72% 92.15%
MRR 0.65 0.67
MP 77.24% 85.93%
MR 73.27% 81.39%
[ MP = Mean Precision, MR = Mean Recall,
MRR = Mean Reciprocal Rank ]
21. COMPARISON ON OPEN SOURCE PROJECTS
(ANSWERED: RQ3)
In OSS projects, CoRReCT also performs better than the
baseline technique.
85.20% accuracy with 84.76% precision and 78.73% recall,
and not significantly different than earlier (p-value=0.239>0.05
for precision)
Results for private and public codebase are quite close.
21
Metric RevFinder [18] CoRReCT (OSS) CoRReCT (VA)
Top-5 Top-5 Top-5
Accuracy 62.90% 85.20% 92.15%
MRR 0.55 0.69 0.67
MP 62.57% 84.76% 85.93%
MR 58.63% 78.73% 81.39%
[ MP = Mean Precision, MR = Mean Recall, MRR = Mean Reciprocal Rank ]
22. COMPARISON ON DIFFERENT PLATFORMS
(ANSWERED: RQ4)
Metrics Python Java Ruby
Beets St2 Avg. OkHttp Orientdb Avg. Rubocop Vagrant Avg.
Accuracy 93.06% 79.20% 86.13% 88.77% 81.27% 85.02% 89.53% 79.38% 84.46%
MRR 0.82 0.49 0.66 0.61 0.76 0.69 0.76 0.71 0.74
MP 93.06% 77.85% 85.46% 88.69% 81.27% 84.98% 88.49% 79.17% 83.83%
MR 87.36% 74.54% 80.95% 85.33% 76.27% 80.80% 81.49% 67.36% 74.43%
22
[ MP = Mean Precision, MR = Mean Recall, MRR = Mean Reciprocal Rank ]
In OSS projects, results for different platforms look
surprisingly close except the recall.
Accuracy and precision are close to 85% on average.
CORRECT does NOT show any bias to any particular platform.
24. THANK YOU!! QUESTIONS?
24
Masud Rahman (masud.rahman@usask.ca)
CORRECT site (http://www.usask.ca/~masud.rahman/correct)
Acknowledgement: This work is supported by NSERC
25. THREATS TO VALIDITY
Threats to Internal Validity
Skewed dataset: Each of the 10 selected projects is
medium sized (i.e., 1.1K PR) except CS.
Threats to External Validity
Limited OSS dataset: Only 6 OSS projects considered—
not sufficient for generalization.
Issue of heavy PRs: PRs containing hundreds of files can
make the recommendation slower.
Threats to Construct Validity
Top-K Accuracy: Does the metric represent effectiveness
of the technique? Widely used by relevant literature
(Thongtanunam et al, SANER 2015)
25
Editor's Notes
Hello everyone.
My name is Mohammad Masudur Rahman
I am a 2nd year PhD student from University of Saskatchewan, Canada.
Today, I am going to talk about code reviewer recommendation based on cross-project and technology experience.
I work with Dr. Chanchal Roy. The other co-author of the paper is Jason Collins from Google, USA.
When I searched in the web, I found this.
Obviously, code review is not a very good experience all the time
If you do not select appropriate reviewers, the review could be disastrous.
One way to handle the frustration about the code review is choosing the appropriate reviewers for your code.
We already had a talk on code review.
Anyway, just to recap, code review is a systematic process of source code
that identifies defects and coding standard violation of the code.
It helps in early bug detection—thus reduces cost.
It also ensures code quality by maintaining the coding standards.
Code review has also evolved. First, it was formal inspection which was time-consuming, slow and costly.
Then came a less formal code review—peer code review.
Now we do tool assisted code review—also called modern code review.
This is an example for code review at GitHub. Once a developer submits a pull request, a way to submitting changes at GitHub,
The core developers/ reviewers can review the changes and provide their feedback like this.
Our goal in this research is to identify appropriate code reviewers for such a pull request.
Identifying such code reviewers is very important especially for the novice developers who do not know the skill set their fellow developers.
It is also very essential for distributed development where the developers rarely meet face to face.
Besides an earlier study suggest that without appropriate code reviewers the whole change submission could be 12 days late on average.
However, identifying such reviewers is challenging since the skill is not obvious, and it would require massive mining of the revision history.
The earlier study analyze line change history of source code, file path similarity of source files and review comments.
In short, they mostly considered the work experience of a candidate code reviewers within a single project only.
However, some skills span across multiple projects such as working experience with specific API libraries or specialized technologies.
Also in an industrial setting, a developer’s contribution scatter throughout different projects within the company codebase.
We thus consider external libraries and APIs included in the changed code and suggest more appropriate code reviewers.
This is the outline of my today’s talk.
We collect commercial projects and libraries from Vendasta codebase, a medium sized Canadian software company.
Then we ask 3 research questions and conduct an exploratory study to answer those questions.
Based on those findings, we then propose our recommendation technique—CORRECT.
Then in the experiments, we experimented commercial projects, compare with the state-of-the-art and also experimented with Open source projects.
Finally, we conclude the talk.
We ask these three research questions.
In a commercial codebase, there are two types of projects– customer project and utility projects. The utility projects are also called libraries.
We ask.
How frequently do the commercial software projects reuse external libraries in their code?
Does working experience on such libraries matter in code reviewer selection? That means does a reviewer with such experience get preference over the others?
Does working experience with specialized technologies such as mapreduce, taskqueue matter in code reviewer selection?
This is connectivity graph of core projects and internal libraries from Vendasta codebase.
We see the graph is pretty much connected, that means most of the libraries are used by most of the projects.
For the study, we chose 10 projects and 10 internal libraries, and they are chosen based on certain restriction.
Each project should have 750 closed pull requests, that means they should be pretty big, and most importantly pretty much active.
Each internal library should be used at least 10 times on average by each of those projects.
Each technology, I mean the specialized technology is should be used at least 5 times on average by each of those projects.
We consider the Google App Engine libraries as the specialized technologies. We consider 10 of them.
This is the usage frequency of the selected libraries in the 10 project we selected for the study.
We take the latest snapshot of each of the projects, analyze their source files, and look for imported libraries using an AST parser.
We try to find out those 10 libraries mostly, and this is the box plot of their frequencies.
We can see that vtest, vauth and vapi are the mostly used, which a kind of make sense especially vtest and vauth, they provide possibly the generic testing and authentication support.
However, vtest has a large variance, that means some projects used it extensively whereas the others didn’t use it at all.
The least used libraries are vlogs and vmonitor.
So, this is the empirical frequencies.
We investigated the ratio of pull requests that used any of the selected libraries.
We note that 30%-70% of all pull requests did that in different projects.
We also investigated the percentage of the library authors who are later recommended as code reviewers for the projects referring to that library.
We considered a developer as library author if he/she authored at least one pull request of the library.
We note that almost 100% of authors are later recommended.
This is a very interesting findings that suggest that library experience really matters.
We also calculated the empirical frequency of the ten specialized technologies in the selected Vendasta projects
And this is the box plot.
We can see that mapreduce is the champion technology here, and the rest are close competitors.
In case of the pull requests, 20%-60% pull requests used at least one of ten specialized technologies
Mostly used by ARM, CS and VBC.
So, specialized technologies are also used in our selected projects quite significantly.
So, here are the empirical findings from the exploratory studies we conducted.
They suggest that library experience and specialized technology experience really matter.
These are new findings, and we exploit them to develop the recommendation algorithm later.
Based on those exploratory findings, we propose CORRECT– Code reviewer recommendation based on cross-project and technology experience.
This is our recommendation algorithm.
Once a new pull request R3 is created, we analyze its commits, then source files, and look for the libraries referred and the specialized technologies used. Thus, we get a library token list and a technology token list.
We combined both lists, and this list can be considered as a summary of libraries and technologies for the new pull request.
Now, we consider the latest 10 but closed pull requests, and collect their library and technology tokens.
It should be noted that the past requests contain their code reviewers.
Now, we estimate the similarity between the new and each of the past requests. We use cosine similarity score between their token list.
We add that score to the corresponding code reviewers.
This way, finally, we get a list of reviewers who got accumulated score for different past reviews.
Then they are ranked and top reviewers are recommended.
Thus, we use pull request similarity score to estimate the relevant expertise of code review.
Now, to be technically specific
The state-of-the-art considers two pull requests relevant/similar if they share source code files or directories.
On the other hand, we suggest that two pull requests are relevant/similar if they share the same external libraries and specialized technologies.
That’s the major difference in methodology and our core technical contribution.
We performed two evaluations– one with Vendasta codebase, and the other with Open source codebase.
From that experiments, we try to answer four research questions.
Are library experience and technology experience useful proxies for code review skills?
Can our technique outperform the state-of-the-art technique from the literature?
Does it perform equally for closed source and open source projects?
Does it show any bias to any particular platform?
We conducted experiments using 10 projects from GitHub codebase and 6 projects from open source domain.
From Vendasta, we collected 13K pull requests, and from open source, we collect 4K pull requests.
Gold reviewers are collected from the corresponding pull requests.
Vendasta projects are python-based whereas the OS projects are written in python, Java and Ruby.
We consider four performance metrics– accuracy, precision, recall, and reciprocal rank.
In case of accuracy, if the recommendation contains at least one gold reviewer, we consider the recommendation is accurate.
This is how we answer the first RQ.
We see that both library similarity and technology similarity are pretty good proxies for code review skills.
Each of them provides over 90% top-5 accuracy.
However, when we combine, we get the maximum—92% top-5 accuracy.
The precision and recall are also greater than 80% which is highly promising according to relevant literature.
We then compare with the state-of-the-art –RevFinder.
We found that our performance is significantly better than theirs. We get a p-value of 0.003 for top-5 accuracy with Mann-Whitney U tests.
The median accuracy 95%. The median precision and median recall are between 85% to 90%
In the case of individual projects, our technique also outperformed the state-of-the-art.
We also experimented using 6 Open source projects, and found 85% Top-5 accuracy.
For the case of precision and recall, they are not significantly different from those with Vendasta projects.
For example, with precision, we get a p-value of 0.239 which is greater than 0.05.
This slide shows how CORRECT performed with the projects from 3 programming platforms– Python, Java and Ruby.
We also find quite similar performance for each of the platforms which is interesting.
This shows that our findings with commercial projects are quite generalizable.
Now to summarize
Code review could be unpleasant or unproductive without appropriate code reviewers.
We fist motivated our technique using an exploratory study which suggested that
--library experience and specialized technology experience really matter for code reviewer selection.
Then we proposed our technique—CORRECT that learns from past review history and then recommends.
We experimented using both commercial and open source projects, and compared with the state-of-the-art.
The results clearly demonstrate the high potential of our technique.
That’s all I have to say today.
Thanks for your time. Questions!!
There are a few threats to the validity of our findings.
-- The dataset from VA codebase is a bit skewed. Most of the projects are medium and only one project is big.
--Also the projects considered from open source domain is limited.
--Also, the technique could be slower for big pull requests.