A friendly, step-by-step guide to , , and the — written so a curious middle schooler can follow along.
By the end of this handbook you will be able to:
Imagine your school cafeteria on pizza day. The food does not magically appear on your tray. Someone has to order the ingredients, store them in fridges, cook them, and serve them neatly. Data engineering is exactly the same — but instead of pizza, the ingredient is .
A data engineer is the cafeteria worker of the digital world. They build the that grab raw data (clicks, photos, weather readings, video views) and prepare it so that scientists, students, or apps can actually use it.
Get data from where it lives (a phone, a sensor, a website) to where it can be analyzed.
Fix typos, remove duplicates, drop empty rows. Real data is messy — like a teenager's room.
Put it in a place (a or ) where finding it is fast.
Don't worry — you don't need them all. We'll focus on Elasticsearch (with a little Go and Java).
Before we touch a single line of code, we need to know what we're dealing with. Data comes in three flavors, kind of like ice cream.
Lives in neat rows and columns, like a spreadsheet. Every row has the same shape.
Example: a list of students with name, age, grade.
Has some shape but is bendy. Things like or XML.
Example: a tweet with text + hashtags + author.
Total chaos. Photos, videos, free-text emails, voice notes. No rows. No columns.
Example: 1,000 selfies. 1,000 PDFs.
The trick is to add a tiny bit of structure. We pull out the important parts of unstructured data and turn them into a JSON document. Watch:
// Original (unstructured) email"Hi! I lost my red water bottle in room 204 on Tuesday."// After data engineering — semi-structured JSON{ "item": "water bottle", "color": "red", "location": "room 204", "day": "Tuesday", "status": "lost"}
Now a computer can search, count, and group all the lost items in a way it never could with raw sentences.
A folder of vacation photos is what kind of data?
stands for three steps: Extract → Transform → Load. It is basically cooking, but for data.
Grab raw ingredients from the store.
Pull data from APIs, databases, files, sensors.
Wash, chop, cook, season the ingredients.
Clean, reformat, combine, validate.
Plate it up and serve.
Save into a database, warehouse, or data lake.
Pretend we run a lemonade stand and our point-of-sale app gives us a CSV file. We want it in our data lake, but cleaner.
# 1. EXTRACT — read the filedate,customer,price2026-04-29,Alex, $4.502026-04-29,jamie ,$3.002026-04-29,Pat,FREE # Pat is the owner's kid# 2. TRANSFORM — clean it up# - strip whitespace from names# - capitalize first letter# - parse "$4.50" → 4.50 (a number)# - drop "FREE" rows (not a real sale)# 3. LOAD — write JSON into the data lake{ "date":"2026-04-29", "customer":"Alex", "price":4.50 }{ "date":"2026-04-29", "customer":"Jamie", "price":3.00 }
A data lake is a giant storage area where companies dump all their data — structured, semi-structured, and unstructured — in its raw form. Think of it as a real lake: water flows in from rivers (apps, sensors, websites). It is deep, it is wide, and you can fish out whatever you need later.
| Feature | Data Lake | Data Warehouse |
|---|---|---|
| Data type | Anything (raw) | Cleaned, structured only |
| When to clean | Later (read time) | Before storing (write time) |
| Best for | Data scientists, ML, exploration | Business reports & dashboards |
| Cost | Cheap | More expensive |
| Vibe | "Throw it in, sort it later" | "Everything in its place" |
Exactly what came in. Unmodified. Like a snapshot.
Validated, deduplicated, normalized. Easy to use.
Aggregated and joined. Polished tables for dashboards.
is like Google… for your own data. It can search through millions of documents in milliseconds because it builds a special structure called an .
You communicate with Elasticsearch using — the same way your browser asks for web pages. Every action is a verb (GET, POST, PUT, DELETE) plus a URL.
# Create an indexPUT /students# Add (index) one documentPOST /students/_doc{ "name": "Riley", "grade": 7, "hobbies": ["chess","art"] }# Search any 7th grader who likes artGET /students/_search{ "query": { "bool": { "must": [ { "match": { "hobbies": "art" } }, { "term": { "grade": 7 } } ]}}}
Here are the endpoints you will use 90% of the time.
Is the cluster alive? Returns green / yellow / red — a stoplight for your servers.
Create a new index, optionally with a mapping (schema).
Index one document. Elasticsearch picks an ID for you.
Index thousands of documents in one trip. The fastest way to load a data lake.
Search using Query DSL. Match, term, range, fuzzy, geo — all live here.
Remove one document. Use carefully (no Ctrl-Z!).
POST /lemonade-sales/_bulk{ "index": { "_id": "1" } }{ "date":"2026-04-29", "customer":"Alex", "price":4.5 }{ "index": { "_id": "2" } }{ "date":"2026-04-29", "customer":"Jamie", "price":3.0 }# Each line alternates: action, document, action, document...
Yes — one JSON object per line, no commas between them. It is called NDJSON.
Go (a.k.a. Golang) is a programming language built at Google for fast network code. Elastic publishes an official Go client. Let's use it to load lemonade sales.
go mod init lemonadego get github.com/elastic/go-elasticsearch/v8@latest
package mainimport ( "bytes" "context" "encoding/json" "log" es "github.com/elastic/go-elasticsearch/v8")type Sale struct { Date string `json:"date"` Customer string `json:"customer"` Price float64 `json:"price"`}func main() { cfg := es.Config{ Addresses: []string{"http://localhost:9200"}, APIKey: "YOUR_API_KEY", // from Kibana } client, err := es.NewClient(cfg) if err != nil { log.Fatal(err) } sale := Sale{Date: "2026-04-29", Customer: "Alex", Price: 4.5} body, _ := json.Marshal(sale) res, err := client.Index.Index( client.Index, bytes.NewReader(body), client.Index.WithIndex("lemonade-sales"), client.Index.WithContext(context.Background()), ) if err != nil { log.Fatal(err) } defer res.Body.Close() log.Println("Indexed!", res.Status())}
query := `{ "query": { "match": { "customer": "Alex" } } }`res, _ := client.Search( client.Search.WithIndex("lemonade-sales"), client.Search.WithBody(strings.NewReader(query)),)io.Copy(os.Stdout, res.Body) // print results
esutil.NewBulkIndexer. It batches and retries automatically.
Java is one of the oldest and most-used languages in big enterprises. Elastic provides the Elasticsearch Java API Client — a modern, type-safe client.
pom.xml<dependency> <groupId>co.elastic.clients</groupId> <artifactId>elasticsearch-java</artifactId> <version>8.13.0</version></dependency>
import co.elastic.clients.elasticsearch.ElasticsearchClient;import co.elastic.clients.elasticsearch._types.Refresh;import co.elastic.clients.json.jackson.JacksonJsonpMapper;import co.elastic.clients.transport.rest_client.RestClientTransport;import org.elasticsearch.client.RestClient;import org.apache.http.HttpHost;public class LemonadeLoader { public record Sale(String date, String customer, double price) {} public static void main(String[] args) throws Exception { RestClient rest = RestClient .builder(new HttpHost("localhost", 9200, "http")) .build(); RestClientTransport transport = new RestClientTransport(rest, new JacksonJsonpMapper()); ElasticsearchClient client = new ElasticsearchClient(transport); Sale sale = new Sale("2026-04-29", "Alex", 4.5); client.index(i -> i .index("lemonade-sales") .document(sale) .refresh(Refresh.WaitFor) ); System.out.println("Indexed sale!"); }}
SearchResponse<Sale> response = client.search(s -> s .index("lemonade-sales") .query(q -> q.match(m -> m .field("customer") .query("Alex") )), Sale.class);response.hits().hits().forEach(hit -> System.out.println(hit.source()));
BulkIngester from the
same package — it auto-flushes when the buffer fills or a timeout hits.
Time to put it all together. Here is the architecture of a tiny but real data lake using Elasticsearch as the searchable layer.
docker run -p 9200:9200 docker.elastic.co/elasticsearch/elasticsearch:8.13.0 — or use Elastic Cloud's free trial.PUT /lemonade-sales{ "mappings": { "properties": { "date": { "type": "date" }, "customer": { "type": "keyword" }, "price": { "type": "float" }, "location": { "type": "geo_point" } } }}
Which API endpoint loads thousands of documents fastest?
Hand-picked links to keep your quest going.
Every endpoint, every parameter — the source of truth.
Getting Started GuideThe friendliest first 30 minutes with Elasticsearch.
Go client docsOfficial go-elasticsearch client guide.
Modern, type-safe Java client by Elastic.
Kibana GuideBeautiful dashboards on top of Elasticsearch.
Query DSLAll the ways to ask Elasticsearch questions.
Tutorials, talks, and demos straight from the team.
TechWorld with NanaBeginner-friendly DevOps & data infrastructure videos.
codebasicsApproachable data engineering & SQL playlists.
IBM TechnologyWhiteboard explainers on Data Lakes vs Warehouses.
Seattle Data GuyCareer and concept videos from a working data engineer.
freeCodeCampFull-length free courses on Go, Java, ETL, and more.
Spin up a real cluster in minutes.
A Tour of GoInteractive Go basics in your browser.
dev.javaOracle's free Java learning hub.
Martin Kleppmann. The bible of modern data systems.
Reis & Housley. A friendlier overview of the whole field.