
It’s also not in GO-JEK’s engineering team to manually do things that can easily be automated : )
GO-JEK is a start-up in Indonesia that has many services. It has been calculated that more than 20 services that GO-JEK has in their applications both on Android and IOS. Undoubtedly a lot of data is entered into the GO-JEK system from every service available on GO-JEK. However, the data is not well structured and integrated, or it can be said that it is still raw. GO-JEK’s data infrastructure itself enables the secure and reliable publishing and consumption of raw and aggregate data. This releases several possibilities in GO-JEK. From AI-based allocations, fraud detection, recommendations to critical real-time business reporting, and monitoring. Given the nature of real-time business in GO-JEK, the entire data infrastructure from the ground up focuses on real-time data processing, unlike traditional batch processing architectures.
Raw data (customers, orders, drivers, etc.) enter the GO-JEK system from two main channels, namely Source Code Instrumentation and Machine Statistics Daemon. Stream Producer gives publishers the power to develop their data schemes without breaking any consumers. Data is encoded as a protocol buffer, a language-neutral-platform-neutral-extension mechanism for structured data serialization. Stream manufacturers publish data that is encoded to the fronting server, very suitable for high traffic. Each team is allocated with one or more fronting servers that are highly available to avoid one-point failure and data loss.
We had to build an algorithm that can not only correctly identify separate clusters from each other, but also allocate a central point for each cluster that best represented where pickups frequently happen
GOJEK’S DATA ENGINEER
GO-JEK uses a clustering model to process the data. For example, POI, or Points of Interest, that is, as a place where a large number of pickups occur. This can be in popular locations, such as malls, train stations, schools, universities, or large housing estates. In the previous version of the GO-JEK application, when users ordered GO-RIDE or GO-CAR at POI, they had to coordinate with the driver to confirm the exact picking point. GO-JEK itself wants to reduce this activity by letting users choose the ‘gate’ or entrance they want. The allocated driver will then immediately arrive at that point without the need to coordinate chat calls or messages. But how does GO-JEK know where these gates are, which are convenient or popular pickup points?
One way to do this is to identify popular pickup points manually. But this is undoubtedly an unreliable approach, given the large number of POIs that GO-JEK wants to find and record. Therefore, they are using clustering methods and using algorithm languages, such as DBSCAN and KMeans to make it possible and automatically.
To know and measure, data engineers and data scientists of GO-JEK must automate this process, that is, build an algorithm that can correctly identify popular pickup point clusters in the same way as the human brain intuitively does, by looking at plots manually.
To do this, brands take advantage of the historical data they have about exactly where customers are taken from around the POI. Below is an example of the location where customers are picked up from around Blok M Square, a mall in Jakarta.

From the picture, it is quite clear to the human eye that there are five areas around the mall where pickups occur most often. But how does GO-JEK make all this automatic? Namely by using clustering algorithm languages such as DBSCAN and Kmeans.
But GO-JEK uses the Kmeans method because it is more accurate to find the number of clusters representing the dataset for POIs than DBSCAN
To measure the accuracy of the model and methodology carried out by GOJEK, they use quantitative models such as statistics and other calculations to measure the accuracy of the method and the model itself. For example, in using Kmeans algorithm language in the clustering model.

However, K Means has one major disadvantage: k, or the number of clusters, needs to be an input to the model, GO-JEK must automate the process of determining the k value for each POI accurately. Through quantitative calculation processes such as the Elbow Method to determine the optimal value of k, Silhouette Analysis of the number of clusters to choose the value of k that will maximize the average silhouette score, and the Stupid Backoff method to ensure that the specified number of clusters is truly optimal, based on the significance of each cluster with respect to all pickups that are in accordance with POI. After finding the number of clusters for POI, GO-JEK then establishes the cluster center (as determined by KMeans) as the picking point or gate. But this matter is not finished yet. The next step is to find the names of these gates – they have to automate this and with that comes a series of challenges using the following methods to be more accurate so that they can be appropriately implemented.
