Content farming in an e-commerce universe

Shashank Baravani
4 min readMar 24, 2018

When you are a big e-commerce company and have a catalogue size thats bigger than population of Mumbai or Delhi, how do you advertise such a large volume of products to your audience? Well here is one way it can be done. And we called it content farming since it worked off a content federation model to create an index of references to landing pages across the site. However it’s not related in any way to the actual content farming technique which is a clickbait to lure users from a search results page, to fictitious pages that exist for the sole purpose of making more advertising money.

Harvesting content

There are departments within the company that are responsible for merchandising, promotions and marketing. They are responsible for crafting interesting collections by tagging products, either manually, or as part of an automated process. For example, Bestsellers, Offers, Discounts, Trending, etc define product collections by their viewership, sales or business characteristics.

Understanding the profile of each member of the audience

Online users who shop frequently leave a behind very elaborate event trail of their journey that is mined and digested to represent each consumer’s interest in products. This is called as the User Affinity Model which describes what is the user interested in and is expressed as a finite set of stores, products and collections. For example, one could say that, a user is interested in “Android phones less than Rs. 25000” or “Apple watches” or “Levis T -shirts” so on and so forth. A user model might look like this [(Store=T-Shirt, brand=Nike, score=11.23), (Store=Mobiles, OS=Android, Price-Range=0–10K, score=5.17)…..etc, etc].

This is accomplished by a MapReduce job that runs everyday, mines the entire event trail of each of the 150+ million users and produces a affinity model or profile that expresses his interests in format described above.

Understanding the common love of the audience

Once we understand each user’s profile, its easy to do the math lay down the most commonly prevalent interests in the entire populace. This is again expressed as a sorted list like the one above [(Store=Shoes, brand=Reebok, score=11.23), (Store=Watches, Price-Range=0–10K, score=5.17)…..etc, etc].

This is accomplished by the Audience Interest Identification process which involves yet another MapReduce job that runs after the previous one, runs simple math and aggregations on the user profiles and produces a sorted set of queryable URLs representing the store or product collection, its type (Offer, Bestseller, etc), etc.

Content Farming

The next step is then to retrieve and ingest all harvested content into a local search index. This index would contain all bestsellers, offers, discounts, trending items, etc matched against the common love of the user base. For example, [Offer on Android Phones], [Bestselling Celio Shirts], [Trending watches] etc. Each such content is represented in form of a SEO capable URL that points to a landing page on the website. They are called as content cards as they contain additional meta data apart from the landing URL. A content card would look like

URL = https://<DomainName>/mens-footwear/sports-shoes/reebok-brand/pr?sid=osp,cil,1c
ProductsCount = 120
Representative Product: <productId>
Type: Bestseller

All of of this is accomplished by the Content Poller using Apache Storm. It consumes the set of queryable URLs published by the Audience Identification process every day and then invokes each URL on a (configurable) periodic basis to gather the landing URLs and underlying meta data from the response. An important data point is the number of products , which if found to be zero, leads to the URL being discarded. Therefore offers and discounts are queried more often since they tend to expire quickly while bestsellers are queried less frequently. All in all we had around 100K content cards indexed in the shared Elastic Search cluster.

Content Serving

Content + Explicit Intent = Satisfaction
Content + Implicit Intent = Delight
Content + Subliminal Intent = Serendipity

Once we have the content and the user’s profile then the next thing is to marry them and produce an experience that will guide the consumers’s discovery process, especially when his intent is implicit or not known. This is especially true for the home page which is the most valuable real estate and from where the consumer begins his journey. In simple terms this means that we shall layout content for the consumer ordered by his most recent and granular interests and going all the way down to his historical and broader tastes.

This is the responsibility of Content Providers whose job is to query the user’s recent and historical profiles from the fast serving Key-Value data store (Aerospike in this case). In parallel they also retrieve the content cards from the index and perform an intelligent selection to determine what’s best for the user. Additionally it is ensured that the content is diverse, non-repetitive and backed by sellable products at the time of surfacing.

Content Ranking

A small note on content ranking. Usually there are 100s of candidates content cards vying to be displayed on the home page within a limited real estate consisting of 4 o5 carousals. Content Cards are ranked by various performance parameters (such as clicks, conversions, available products, etc ) as well as user’s own recent and historical preferences before they are sent back to the UI for rendering.

--

--