Fakhri's note

Featured

HOW DOES GOJEK’S BIG DATA WORK

“If we can guess, you don’t need a lot of clicks. The fewer clicks, the more fun and more addictive (with the Go-Jek app), Nadiem, CEO of Gojek

As one of the startups with the title unicorn in Indonesia, Go-Jek needs to continue to innovate so that the products and businesses they develop continue to grow. Moreover, Go-Jek itself has 24 services, which are goride, gocar, and gofood as services that are often used. But from these services, there must be customer data and drivers that enter the Go-Jek system. Therefore, to sustain the business, Gojek must innovate to make company performance more effective and efficient, one of which is the use of big data. Companies that manage to process big data themselves well will benefit, and help in formulating strategies, especially in making business decisions. Big data itself is a collection of raw data, which can be analyzed computationally to reveal patterns and trends, especially those relating to human behavior and interaction. From this big data, Go-Jek can see the behavior of their users (customer behavior), in this case, consumers and driver-partners, which will later be processed and analyzed to be useful information for the organization. This is intended so that in the future, all Go-Jek employees can communicate based on data, not assumptions.

At Go-Jek, this task is left to the business intelligence division, which has 46 members. It also includes a growth team. Besides that, the role of the data scientist is also vital. The role of business intelligence and data scientist on data processing is the same. They are the same as doing data visualization, mathematical and statistical calculations. However, the difference between the two is how they see and decide based on the data

The basic difference is the focus of its own analysis. While business intelligence is a function that analyzes retrospectively, science data is more predictive
Muhammad Adrian , Head of Driver Growth and Performance (DRONE)

Goride or Gocar, the pickup location column often has an address. While in the destination column, there are three addresses, one of which is the customer’s destination. Here business intelligence must accommodate and analyze data relating to passenger order activities as well as addresses that are often addressed, after which they make a suggestion or decision based on the data, they unite and integrate through a database program. After that, the data scientist will predict how to make the customer does not need to enter the address manually. By linking quantitative data with complex business questions, and generating measurable solutions by utilizing various analytical tools and quantitative techniques so that they can make real-time decisions, using technologies such as machine learning, artificial intelligence (AI), and also natural language processing. In the end, the contents in the pickup column and the destination are usually right, though sometimes mistaken. If it’s right, the customer doesn’t need to bother entering it again manually. It is enough to press the message or order button.

Another thing that Go-Jek can do with big data that contains consumer behavior data is to manage who gets what orders. The CEO realizes that drivers who are Go-Jek’s partners often pick and choose orders on their own. Some drivers avoid orders at certain restaurants at certain hours. Some often cancel orders if they get orders in a certain direction. Big data can be relied upon to arrange for drivers to get orders that they will not cancel. For example, Driver G never wants to take orders in a restaurant when the location is far from home. So going forward, the driver will not be given another Go-Food order at the restaurant. In addition to not disappointing consumers, that way, Go-Jek also does not need to continually subsidize drivers to motivate them to take orders that they are reluctant and efficient at even Go-Jek’s finances.

Go-Jek’s way of processing big data also plays a significant role in the sustainability of the object itself with the achievement of a USD valuation of 9.5 billion or equivalent to Rp 134 trillion. This is one of Go-Jek’s efforts to continue to innovate and should be emulated by other technology startups in Indonesia to produce growth in the middle of the data-centric era.

SENTIMENT ANALYSIS ON #HARIGURUNASIONAL2019 USING ORANGE

#HARIGURUNASIONAL2019

Orange is a component-based visual programming software package for data visualization, machine learning, data mining, and data analysis. Orange components are called widgets and they range from simple data visualization, subset selection, and preprocessing, to empirical evaluation of learning algorithms and predictive modeling. Visual programming is implemented through an interface in which workflows are created by linking predefined or user-designed widgets.

Orange can also be used as a tool for text mining. Text mining itself is how to dig data on a platform. On this occasion, I want to do sentiment analysis on the Twitter platform using orange tools. Sentiment analysis to find out the user’s reaction to something. Therefore, I want to analyze the sentiment or the reaction of Twitter accounts for # HariGuruNasional2019 using orange.

The picture above is the format for text mining using the sentiment analysis method in orange.

Next, click on the twitter icon, and enter # HariGuruNasional2019 and here I enter the maximum number of tweets that I want to analyze, which is 100. Moreover, the Twitter API must be included also to facilitate the process of text mining. The API can be obtained from the Twitter Developer website

After clicking search, then we can see in the corpus viewer, to see a list of 100 tweetes that discuss and allude to # HariGuruNasional2109 which has been crawled on Twitter

Then click Topic Modeling to identify the topics that Twitter users are talking about # HariGuruNasional2019. Here Topic Modeling is made into 10 topics

Click Word Cloud to identify the words Twitter users talked about #HariGuruNasional2019. There are 801 words collected from 100 tweets. Moreover, the word that often appears is the guru’s word, which is 137

After that, click sentiment analysis and profiler tweets on the orange menu, and after the process is complete the top picture will appear for sentiment analysis from # HariGuruNasional2019. There is a picture of 20% overall account feeling angry of 100%, there is a whole 100% of 100% who feel disgust, there are overall 2.4% of 100% who feel joy, there are overall 2.7% of 100% who feel sadness, there are overall 12.5% of 100% who feel shocked.

HOW DOES SNA WORK?

Social Network Analysis (SNA) is a tool for mapping / visualizing relationships between individuals. SNA was developed to understand the relations (ties/edge) of the actors (nodes/points) that exist in a system with two focus, namely actors and relationships between actors in a particular social context. This focus helps in understanding how the position of existing actors can affect access to existing resources such as goods, capital, and information. Information is one of the essential resources or resources flowing in a network, so SNA is often implemented to identify the flow of information.

One method that must first be done before analyzing a network is crawling. Crawling here aims to retrieve and gather information from several connections to be further analyzed in the form of social network analysis. And here, I will browse the list of followers and following my friend’s Twitter account, Haydar, to find out the ego network of Haydar’s social network.

The first thing to do in making SNA is through crawling with R studio tools and this is the script to crawl (collect) follower data and folowing data @haydars_alfathi

After that, we will get a raw CSV file which will then be simplified into edges and nodes files for further analysis and visualization through the Gephi application.

After the edges and nodes table is created, it will then be analyzed and visualized through Gephi software. This software is one software that is often used in Social Network Analysis. I will here analyze the ego network of the @haydars_Alfathi account to find out the relationship of the account with its social network.

This is a form of visualization of the SNA ego network from the @haydars_alfathi account, where we can analyze the attributes of SNA, namely the betweenness, closeness, degree, and modularity of the SNA. Based on these data, I use resolution modularity of 0.49 to form 4 groups because Modularity measurement can also be interpreted to indicate the number of groups that will be formed using
the clustering coefficient. And can also be seen the highest betweenness is in the account @haydars_alfathi, which means
it has a high potential in manipulating the outcome of a network (the quality of collaboration between networks. Moreover, the highest closeness also exists on the @haydars_alfathi account, which can indicate the speed of information dissemination. The greatest degree of both degree and out-degree is also on the @haydars_Alfathi account. The higher the degree value of a node, then more and more individual acquaintances have represented
node or can be called a key player

How does rapid miner work

RapidMiner is a software that is open source (open source). RapidMiner is a solution for analyzing data mining, text mining, and predictive analysis. RapidMinner uses various descriptive and predictive techniques in providing insights to users so that they can make the best decisions
This application provides data mining and machine learning procedures including ETL (extraction, transformation, loading), data preprocessing, visualization, modeling, and evaluation. The data mining process is composed of nestable operators, described in XML, and created with a GUI. Written in the Java programming language. Integrate the Weka data mining project and R statistics

For experiments, I use Premier League standings data. The aim is to find out who will be the Premier League champions based on match results points until week 15 or match 15. Data were taken based on three seasons back (2016-2019) from http://www.premierleague.com. Here is an example of the data.

Now try to process the data using the Decision Tree rule on rapid miners. The way is as follows

Open the RapidMiner application, click File then-new process until there is a display as follows: (Here the author uses IOS for the processor)

This part of the process is to add the processes used to process the data. This is because the method is used to process data based on input -> process -> output methods. The output will come out a diagram (decision tree) that represents particular information.

To import the data, click the import file icon on the repositories menu and then select import because the data. Select the document that holds the data:

Click Next until you see the imported data in the RapidMiner application

Click next to add an annotation, if it will not add an annotation, click next again.

In the variable change display, change the Champs column type containing binomial data (yes / no) to label type. For Decision Trees, label column types must be present so that data can be processed.

Click Next, then name the data and the data is stored in the local Reprository

Click the Finish button, the process continues to manage the process of the data

Select the Local Reprository folder. Drag the data that was imported from Excel into the Process section.
Look for the Decision Tree operation on the Operators menu. Drag the operation to the process field. So the process looks like this:

Connect Retrieve dataEPL with Decision Tree operations, making sure that when connected do not see an error message. Also connect the output from the decision tree to the right side, or to the res bumps. Here is the connected process.

When finished, press play/run, if there are no errors, the following results will appear:

With this data, we obtain information that:

Teams with a ranking above 1.5 at the time of entering the 15th week, in the last three years confirmed to be champions, however, the teams ranked below 1.5 when entering the 15th week in the previous three years, can not be ascertained to be champion 🙂

DATA IN GOJEK

“Helping to improve the structure of transportation in Indonesia, it provides convenience for the public in carrying out daily work such as sending documents, spending daily, using courier facilities, and contributing to the prosperity of the motorcycle taxi drivers in Jakarta and Indonesia going forward. ” The goal or the vision of GO-JEK stated that they want to increase the welfare of the society through the GO-JEK apps and The moment you focus on a goal, your goal becomes a magnet, pulling you and your resources toward it. The more focused your energies, the more power you generate. Therefore GO-JEK builds products that help millions of Indonesians travel, shop, eat, and pay every day. The ‘Data Engineering’ (DE) team is responsible for creating reliable data infrastructure in all 18 GO-JEK products. In the process of realizing this vision, here are a few things:
Scale: Data in GO-JEK does not grow linearly with business, but exponentially, when people start to build new products and record new activities on top of business growth.
Automation: Working on a large scale makes it very important to automate everything from deployment to infrastructure. One of them is through the use of Big Data. In this way, GO-JEK can push features faster without causing chaos and disruption to the production environment.
Product mindset: The DE team operates as an internal B2B SaaS company. GO-JEK measures success with business metrics such as user adoption, retention, and revenue, or the cost savings generated per feature. And our users are Product Managers, Developers, Data Scientists, and Analysts at GO-JEK.

The data infrastructure of GO-JEK enables easy and reliable publishing and consumption of raw and aggregate data. This releases several possibilities in GO-JEK. From AI-based allocations, fraud detection, recommendations to critical real-time business reporting, and monitoring. Given the nature of real-time business in GO-JEK, the entire data infrastructure from the ground up focuses on real-time data processing, unlike traditional batch processing architectures.

Moreover, GO-JEK is an on-demand based application which is demanded to serve quickly and responsively. The process from raw data collection to data visualization contained in the GO-JEK system itself must be integrated rapidly and accurately so that data errors and bugs do not occur which will have an impact on decision making, application development, and also customer satisfaction.

HOW DOES GOJEK’S BIG DATA WORK 2

It’s also not in GO-JEK’s engineering team to manually do things that can easily be automated : )

GO-JEK is a start-up in Indonesia that has many services. It has been calculated that more than 20 services that GO-JEK has in their applications both on Android and IOS. Undoubtedly a lot of data is entered into the GO-JEK system from every service available on GO-JEK. However, the data is not well structured and integrated, or it can be said that it is still raw. GO-JEK’s data infrastructure itself enables the secure and reliable publishing and consumption of raw and aggregate data. This releases several possibilities in GO-JEK. From AI-based allocations, fraud detection, recommendations to critical real-time business reporting, and monitoring. Given the nature of real-time business in GO-JEK, the entire data infrastructure from the ground up focuses on real-time data processing, unlike traditional batch processing architectures.

Raw data (customers, orders, drivers, etc.) enter the GO-JEK system from two main channels, namely Source Code Instrumentation and Machine Statistics Daemon. Stream Producer gives publishers the power to develop their data schemes without breaking any consumers. Data is encoded as a protocol buffer, a language-neutral-platform-neutral-extension mechanism for structured data serialization. Stream manufacturers publish data that is encoded to the fronting server, very suitable for high traffic. Each team is allocated with one or more fronting servers that are highly available to avoid one-point failure and data loss.

We had to build an algorithm that can not only correctly identify separate clusters from each other, but also allocate a central point for each cluster that best represented where pickups frequently happen
GOJEK’S DATA ENGINEER

GO-JEK uses a clustering model to process the data. For example, POI, or Points of Interest, that is, as a place where a large number of pickups occur. This can be in popular locations, such as malls, train stations, schools, universities, or large housing estates. In the previous version of the GO-JEK application, when users ordered GO-RIDE or GO-CAR at POI, they had to coordinate with the driver to confirm the exact picking point. GO-JEK itself wants to reduce this activity by letting users choose the ‘gate’ or entrance they want. The allocated driver will then immediately arrive at that point without the need to coordinate chat calls or messages. But how does GO-JEK know where these gates are, which are convenient or popular pickup points?
One way to do this is to identify popular pickup points manually. But this is undoubtedly an unreliable approach, given the large number of POIs that GO-JEK wants to find and record. Therefore, they are using clustering methods and using algorithm languages, such as DBSCAN and KMeans to make it possible and automatically.

To know and measure, data engineers and data scientists of GO-JEK must automate this process, that is, build an algorithm that can correctly identify popular pickup point clusters in the same way as the human brain intuitively does, by looking at plots manually.
To do this, brands take advantage of the historical data they have about exactly where customers are taken from around the POI. Below is an example of the location where customers are picked up from around Blok M Square, a mall in Jakarta.

From the picture, it is quite clear to the human eye that there are five areas around the mall where pickups occur most often. But how does GO-JEK make all this automatic? Namely by using clustering algorithm languages such as DBSCAN and Kmeans.
But GO-JEK uses the Kmeans method because it is more accurate to find the number of clusters representing the dataset for POIs than DBSCAN

To measure the accuracy of the model and methodology carried out by GOJEK, they use quantitative models such as statistics and other calculations to measure the accuracy of the method and the model itself. For example, in using Kmeans algorithm language in the clustering model.

However, K Means has one major disadvantage: k, or the number of clusters, needs to be an input to the model, GO-JEK must automate the process of determining the k value for each POI accurately. Through quantitative calculation processes such as the Elbow Method to determine the optimal value of k, Silhouette Analysis of the number of clusters to choose the value of k that will maximize the average silhouette score, and the Stupid Backoff method to ensure that the specified number of clusters is truly optimal, based on the significance of each cluster with respect to all pickups that are in accordance with POI. After finding the number of clusters for POI, GO-JEK then establishes the cluster center (as determined by KMeans) as the picking point or gate. But this matter is not finished yet. The next step is to find the names of these gates – they have to automate this and with that comes a series of challenges using the following methods to be more accurate so that they can be appropriately implemented.