The Producer used in the lab generates mock Clickstream data for an e-commerce site. The scenario is that a user logs into the site, lands on the home page, browses the product catalog, lands on individual product pages, and either adds the product to cart or not and conitues to do that until the user either creates an order and does order checkout or abandons the session. A sample session looks like this:
The producer generates the events in a user session in sequence but runs multiple threads to simulate multiple users hitting the site. The Producer can utilize multiple threads to parallelize event generation and the number of threads can be specified by using parameters. The producer utilizes the Confluent Schema Registry and generates Avro encoded events. Consequently, the location of the Schema Registry needs to be specified in a producer.properties_msk file. The default value, if none is provided, is http://127.0.0.1:8081. For each event generated, the Producer assigns a Global seq number that is unique across the threads and sequential. The Producer assigns a UserId as the partition key for the events sent to Apache Kafka which means that the vents for the same User always go to the same Kafka partition which would allow stateful processing of user events in order. However, the Global seq numbers are spread out across multiple partitions in Apache Kafka. When a consumer (in this case a single consumer in a consumer group utilizing the high level Kafka consumer) reads from the Topic, it receives messages from all partitions and so the Global seq numbers are received in non sequential order. But the highest Global seq number received by the consumer at any point can be utilized to figure out how far behind the producer, the consumer is or if the consumer is caught up with the producer.
java -jar KafkaClickstreamClient-1.0-SNAPSHOT.jar -t ExampleTopic -pfp /tmp/kafka/producer.properties_msk -nt 8 -rf 300