аромат — рекомендательный код Facebook с открытым исходным кодом, основанный на машинном обучении.

Тысячи инженеров пишут код для создания наших приложений, которые обслуживают миллиарды людей по всему миру.Это нетривиальная задача — наши сервисы стали настолько разнообразными и сложными, что кодовая база содержит миллионы строк кода, которые пересекаются с широким различных систем, от обмена сообщениями до рендеринга изображений. Чтобы упростить и ускорить процесс написания кода, который повлияет на столь многие системы, инженерам часто нужен способ узнать, как кто-то еще справился с аналогичной задачей. Мы создали Aroma, инструмент поиска и рекомендаций по коду, который использует машинное обучение (ML), чтобы сделать процесс получения информации из больших кодовых баз намного проще.

Prior to Aroma, none of the existing tools fully addressed this problem. Documentation tools are not always available and can be out of date, code search tools often return myriad matching results, and it is difficult to immediately find idiomatic usage patterns. With Aroma, engineers can easily find common coding patterns without the need to manually go through dozens of code snippets, saving time and energy in their day-to-day development workflow.

In addition to deploying Aroma to our internal codebase, we also created a version of Aroma on open source projects. All examples in this post are taken from a collection of 5,000 open source Android projects on GitHub.

What is code recommendation and when do you need it?

Давайте рассмотрим случай инженера Android, который хочет посмотреть, как другие написали аналогичный код.Допустим, инженер пишет следующее для декодирования растрового изображения на телефоне Android:

Bitmap bitmap = BitmapFactory.decodeStream(input);

This works, but the engineer wants to know how others have implemented this functionality in related projects, especially what common options are set, or what common errors are handled, to avoid crashing the app in production.

Aroma enables engineers to make a search query with the code snippet itself. The results are returned as code recommendations. Each code recommendation is created from a cluster of similar code snippets found in the repository and represents a common usage pattern. Here is the first recommendation returned by Aroma for this example:

Code Sample 1

final BitmapFactory.Options options = new BitmapFactory.Options();
options.inSampleSize = 2;
// ...
Bitmap bmp = BitmapFactory.decodeStream(is, null, options);

This piece of code recommendation is synthesized from a cluster of five similar methods found in the repository. Only the common code among the cluster of methods is shown here, with the specific details of individual methods removed in the gap (the ... part).

What is this code recommendation trying to say? Well, it says that in five different cases, engineers set additional options when decoding the bitmap. Setting the sample size helps reduce the memory consumption when decoding large bitmaps, and, indeed, a popular post on Stack Overflow suggests the same pattern. Aroma created this recommendation automatically by discovering a cluster of code snippets that all contain this pattern.

Рассмотрим еще одну рекомендацию.

Code Sample 2

try {
  InputStream is = am.open(fileName);
  image = BitmapFactory.decodeStream(is);
  is.close();
} catch (IOException e) {
  // ...
}

This code snippet is clustered from another four methods. It shows a customary usage of InputStream in decoding bitmaps. Furthermore, this recommendation demonstrates a good practice to catch the potential IOException when opening the InputStream. If this exception occurs in runtime and is not caught, the app will crash immediately. A responsible engineer should extend the code using this recommendation and handle this exception properly.

Aroma code recommendations integrated in the coding environment.

По сравнению с традиционными инструментами поиска кода функция рекомендации кода Aroma имеет несколько преимуществ:

Aroma performs search on syntax trees. Rather than looking for string-level or token-level matches, Aroma can find instances that are syntactically similar to the query code and highlight the matching code by pruning unrelated syntax structures.
Aroma automatically clusters together similar search results to generate code recommendations. These recommendations represent idiomatic coding patterns and are easier to consume than unclustered search matches.
Aroma is fast enough to use in real time. In practice, it creates recommendations within seconds even for very large codebases and does not require pattern mining ahead of time.
Основной алгоритм Aroma не зависит от языка Мы развернули Aroma в наших внутренних базах кода на Hack, JavaScript, Python и Java.

How does Aroma work?

Aroma creates code recommendations in three main stages:

1) Feature-based search

First, Aroma indexes the code corpus as a sparse matrix. It does this by parsing each method in the corpus and creating its parse tree. Then it extracts a set of structural features from the parse tree of each method. These features are carefully chosen to capture information about variable usage, method calls, and control structures. Finally, it creates a sparse vector for each method according to its features. The feature vectors for all method bodies become the indexing matrix, which is used for the search retrieval.

When an engineer writes a new code snippet, Aroma creates a sparse vector in the manner described above and takes the dot product of this vector with the matrix containing the feature vectors of all existing methods. The top 1,000 method bodies whose dot products are highest are retrieved as the candidate set for recommendation. Even though the code corpus could contain millions of methods, this retrieval is fast due to efficient implementations of dot products of sparse vectors and matrices.

2) Reranking and clustering

After Aroma retrieves the candidate set of similar-looking methods, the next phase is to cluster them. In order to do this, Aroma first needs to rerank the candidate methods by their similarity to the query code snippet. Because the sparse vectors contain only abstract information about what features are present, the dot product score is an underestimate of the actual similarity of a code snippet to the query. Therefore, Aroma applies pruning on the method syntax trees to discard the irrelevant parts of a method body and retain only the parts that best match the query snippet, in order to rerank the candidate code snippets by their actual similarities to the query.

After obtaining a list of candidate code snippets in descending order of similarity to the query, Aroma runs an iterative clustering algorithm to find clusters of code snippets that are similar to each other and contain extra statements useful for creating code recommendations.

3) Intersecting: The process of creating code recommendations

Code snippet 1 (adapted from this project):

InputStream is = ...;
final BitmapFactory.Options options = new BitmapFactory.Options();
options.inSampleSize = 2;
Bitmap bmp = BitmapFactory.decodeStream(is, null, options);
ImageView imageView = ...;
imageView.setImageBitmap(bmp);
// some more code

Code snippet 2 (adapted from this project):

BitmapFactory.Options options = new BitmapFactory.Options();
while (...) {
  in = ...;
  options.inSampleSize = 2;
  options.inJustDecodeBounds = false;
  bitmap = BitmapFactory.decodeStream(in, null, options);
}

Code snippet 3 (adapted from this project):

BitmapFactory.Options bmpFactoryOptions = new BitmapFactory.Options();
// some setup code
try {
  options.inSampleSize = 2;
  loadedBitmap = BitmapFactory.decodeStream(inputStream, null, bmpFactoryOptions);
  // some code...
} catch (OutOfMemoryError oom) {
}

Алгоритм пересечения работает, беря первый фрагмент кода в качестве «базового» кода, а затем итеративно применяя к нему сокращение по отношению ко всем другим методам в кластере.Оставшийся код после процесса сокращения будет общим кодом. среди всех методов, и он становится рекомендацией по коду. Дополнительные сведения см.our paper on the topic.

In this example, each code snippet contains code that is specific to their projects, but they all contain the same code that sets up the options for decoding the bitmap. As described, Aroma finds the common code by first pruning the lines in the first code snippet that do not appear in the second snippet. The intermediate result would look like this:

InputStream is = ...;
final BitmapFactory.Options options = new BitmapFactory.Options();
options.inSampleSize = 2;
Bitmap bmp = BitmapFactory.decodeStream(is, null, options);

The code in code snippet 1 about ImageView does not appear in code snippet 2 and is therefore removed. Now Aroma takes this intermediate snippet and prunes the lines that do not appear in code snippet 3, code snippet 4, and so on. The resulting code is returned as a code recommendation. As shown in Code Sample 1, the code recommendation created from this cluster contains exactly the three lines of code that are common among all method bodies.

Другие рекомендации по коду создаются из других кластеров таким же образом, и алгоритм Aroma гарантирует, что эти рекомендации существенно отличаются друг от друга, поэтому инженеры могут изучать широкий спектр шаблонов кодирования, глядя на всего несколько фрагментов кода. Например, пример кода 2 — это рекомендация, рассчитанная из другого кластера.

This is the true advantage of using Aroma. Rather than going through dozens of code search results manually and figuring out idiomatic usage patterns by hand, Aroma can do it automatically and in just a few seconds!

Broader picture

Given the vast amount of code that already exists, we believe engineers should be able to easily discover recurring coding patterns in a big codebase and learn from them. This is exactly the ability that Aroma facilitates. Aroma and Getafix are just two of several big code projects we are working on that leverage ML to improve software engineering. With the advances in this area, we believe that programming should become a semiautomated task in which humans express higher-level ideas and detailed implementation is done by the computers themselves.

Мы хотели бы поблагодарить Koushik Sen и Di Yang за их работу над этим проектом.

Written by

Celeste Barnaby

Software Engineer, Facebook

Satish Chandra

Software Engineering Manager, Facebook

Frank Luan

Software Engineer, Facebook