Zhang Gong is a programmer, mainly doing Java development, once went to a software company to interview for a software development position, the interviewer asked him two questions, one of which is like this:

Tell me the difference between a left join and a free semi join?

For the left join Zhang Gong is used more often in writing SQL in normal times, but for the left semi join Zhang Gong is rarely used, so it is impossible to say why it is coming.

The interviewer saw that Zhang Gong could not answer, because the position for which Zhang Gong applied had higher requirements for the level of writing SQL, the interviewer did not continue to ask questions, and asked him to go back first and wait for the notice.

It's so heart-wrenching! A question was not answered and he was called back to wait for the notification.

Interviewer: Tell me the difference between a left join and a free semi join?

Let's see what's the difference between a lett join and a free semi join in hive.

For the loft join we usually use, we will not elaborate on it, focus on the next loft semi join.

left semi join

As it can be understood, LEFT SEMI JOIN is a more efficient implementation of IN/EXISTS subqueries.

example

select 
    a.key,
    a.value
from table_name_a a
where a.key in (select b.key from table_name_b b)

We can rewrite this sql as

select 
    a.key,
    a.value
from table_name_a  a
left semi join table_name_b on(a.key=b.key)

Left semi join feature

The left semi join restriction, the table on the right of the join clause can only be filtered in the on clause, and filtering in the where clause, select clause, or elsewhere is not advisable.
Because the loft semi join is to pass only the join key of the table to the map stage, the result of the last select in the loft semi join is only allowed to appear in the left table.
Because left semi join is an in(keySet) relationship, when a duplicate record is encountered in the right table, the left table will skip and join will be traversed all the time. This causes that if the right table has duplicate values, the left semi join produces only one, and the join produces many records, which will also cause the performance of the left semi join to be higher.

The above description may not be easy to understand, but it is still very difficult to understand, let's take an example.

Start by creating two tables (the Student Table and the Scores Table) and initialize the data with these two tables.

--Student table

create table test.tb_student (
  id int comment 'id',
  name string comment '姓名',
 age int comment '年龄'
)

-- Initialize the data

INSERT INTO test.tb_student VALUES ('1', '张三', '7');
INSERT INTO test.tb_student VALUES ('2', '李四', '7');
INSERT INTO test.tb_student VALUES ('3', '王五', '7');
INSERT INTO test.tb_student VALUES ('4', '小爱', '7');

Query the data for the test.tb_student table under the table

select * from test.tb_student

--Score table

create table test.course (
 id int comment '学生id',
name string comment '课程名称',
score string comment '分数'
)

INSERT INTO test.course VALUES ('1', '语文','90');
INSERT INTO test.course VALUES ('1', '数学','100');
INSERT INTO test.course VALUES ('1', '英语','90');

INSERT INTO test.course VALUES ('2', '语文','90');
INSERT INTO test.course VALUES ('2', '数学','90');
INSERT INTO test.course VALUES ('2', '英语','100');

INSERT INTO test.course VALUES ('3', '语文','100');
INSERT INTO test.course VALUES ('3', '数学','100');
INSERT INTO test.course VALUES ('3', '英语','99');

INSERT INTO test.course VALUES ('4', '语文','100');
INSERT INTO test.course VALUES ('4', '数学','100');
INSERT INTO test.course VALUES ('4', '英语','100');

Query the data in the test.course table

select * from test.course

With the above operation, we have successfully created two tables and initialized their data.

Let's take a look at what kind of data is queried after using the left semi join and left join operations?

-- left semi join

select * from test.tb_student a
left semi join test.course b on(a.id=b.id)

Query results:

If you are careful, you will find that this is not the data of the test.tb_student table we queried earlier? Yes, it is, because only the fields of the t1 (left table) table are allowed in the result of the select of the loft semi join

The SQL just now can actually be equivalent

select * from test.tb_student a where a.id in (select id from test.course)

and

select * from test.tb_student a where exists (select 1 from test.course b where a.id=b.id)

The results of their execution are consistent.

Let's take a look at what the result of querying with let's join will be.

--left join

select * from test.tb_student a
left join test.course b on a.id=b.id

From the execution results of the above two, we can clearly see the difference between the left semi join and the left join, which also proves the characteristics of the left semi join mentioned above.

Left semi join is to pass only the table's join key to the map stage, so the result of the last select in the loft semi join is only allowed to appear in the left table.
Left semi join encounters duplicate records in the right table, and the left table is skipped

During the interview, if the interviewer asks such a detailed question as the difference between the left join and the loft semi join, the more friendly answer may wish to refer to this:

The table on the right in the left semi join clause can only be filtered in the on clause
The result of the last select in the left semi join is only allowed in the left table.
Because left semi join is an in(keySet) relationship, when a duplicate record is encountered in the right table, the left table will skip and join will be traversed all the time.

If you can answer the characteristics of the left semi join, it shows that you have full development experience in this area, and I believe you can add points to this interview.

summary:

The difference in how duplicate keys are handled in the right table: because the left semi join is an in(keySet) relationship, when a duplicate record is encountered in the right table, the left table will skip and join on will always traverse.
The result of the last select in the left semi join is only allowed to appear in the left table, because only the join key in the right table participates in the association calculation, while join on default is that the entire relational model participates in the calculation.

Due to the author's limited knowledge and level, errors and omissions in the text are inevitable, if there are deficiencies, welcome to communicate.

#Programmer ##职场头条 #

Interviewer: Tell me the difference between a left join and a free semi join?

left semi join